We have been “manually” extracting data in relation to the patterns they form for many years but as the volume of data and the varied sources from which we obtain it grow a more automatic approach is required.
The cause and solution to this increase in data to be processed has been because the increasing power of computer technology has increased data collection and storage.
Direct hands-on data analysis has increasingly been supplemented, or even replaced entirely, by indirect, automatic data processing.
Data mining is the process uncovering hidden data patterns and has been used by businesses, scientists and governments for years to produce market research reports. A primary use for data mining is to analyse patterns of behaviour.
Data mining can be easily be divided into stages
Once the objective for the data that has been deemed to be useful and able to be interpreted is known, a target data set has to be assembled. Logically data mining can only discover data patterns that already exist in the collected data, therefore the target dataset must be able to contain these patterns but small enough to be able to succeed in its objective within an acceptable time frame.
The target set then has to be cleansed. This removes sources that have noise and missing data.
The clean data is then reduced into feature vectors,(a summarized version of the raw data source) at a rate of one vector per source. The feature vectors are then split into two sets, a “training set” and a “test set”. The training set is used to “train” the data mining algorithm(s), while the test set is used to verify the accuracy of any patterns found.
Data mining commonly involves four classes of task:
Validation of Results
The final stage is to verify thatthe patterns produced by the data mining algorithms occur in the wider data set as not all patterns found by the data mining algorithms are necessarily valid.
If the patterns do not meet the required standards, then the preprocessing and data mining stages have to be re-evaluated. When the patterns meet the required standards then these patterns can be turned into knowledge.