The term data mining emerged from the database marketing community sometime between the late 1970s and early 1980s. Statisticians did not understand the excitement and activity caused by this new technique, since the discovery of patterns and relationships (structure) in the data is not new to them. They had known about data mining for a long time, albeit under various names such as data fishing, snooping, and dredging, and most disparaging, “ransacking” the data.
Because any discovery process inherently exploits the data, producing spurious findings, statisticians did not view data mining in a positive light. Simply looking for something increases the odds that it will be found; therefore looking for structure typically results in finding structure. All data have spurious structures, which are formed by the “forces” that makes things come together, such as chance. The bigger the data, the greater are the odds that spurious structures abound. Thus, an expectation of data mining is that it produces structures, both real and spurious, without distinction between them. Today, statisticians accept data mining only if it embodies the EDA paradigm. They define data mining as any process that finds unexpected structures in data and uses the EDA framework to insure that the process explores the data, not exploits it.
Note the word “unexpected,” which suggests that the process is exploratory, rather than a confirmation that an expected structure has been found. By finding what one expects to find, there is no longer uncertainty as to the existence of the structure. Statisticians are mindful of the inherent nature of data mining and try to make adjustments to minimize the number of spurious structures identified. In classical statistical analysis, statisticians have explicitly modified most analyses that search for interesting structure, such as adjusting the overall alpha level/type I error rate, or inflating the degrees of freedom.
In data mining, the statistician has no explicit analytical adjustments available, only the implicit adjustments affected by using the EDA paradigm itself. The following steps outline the data mining/EDA paradigm. As expected from EDA, the steps are defined by soft rules. Suppose the objective is to find structure to help make good predictions of response to a future mail campaign. The following represent the steps that need to be taken:
- Obtain the database that has similar mailings to the future mail campaign.
- Draw of sample from the database. Size can be several folds of 10,000, up to 100,000.
- Perform many exploratory passes of the sample. That is, do all desired calculations to determine the interesting or noticeable structure.
- Stop the calculations that are used for finding the noticeable structure.
- Count the number of noticeable structures that emerge. The structures are not final results and should not be declared significant findings.
- Seek out indicators, visual and numerical, and the indirect messages.
- React or respond to all indicators and indirect messages.
- Ask questions. Does each structure make sense by itself? Do any of the structures form natural groups? Do the groups make sense; is there consistency among the structures within a group?
- Try more techniques. Repeat the many exploratory passes with several fresh samples drawn from the database. Check for consistency across the multiple passes. If results do not behave in a similar way, there may be no structure to predict response to a future mailing, as chance may have infected your data. If results behave similarly, then assess the variability of each structure and each group.
- Choose the most stable structures and groups of structures for predicting response to a future mailing.
For more information, see: