Practical Predictive Analytics
上QQ阅读APP看书,第一时间看更新

CRISP-DM and SEMMA

Cross-Industry Standard process for Data Mining (CRISP-DM) and Sample, Explore, Modify, Model, and Assess (SEMMA) are two standard data mining methodologies that have been utilized for many years and describe a general methodology for implementing analytical projects. There is a good deal of overlap between the methodologies, even though the names for each step are different. All of the listed steps are important to the success of a predictive analytics project. However, it is not necessary that these steps be followed exactly in order. The concepts outlined are more or less an outline of best practices. It helps to be aware of the importance of each of these steps, and understand how each step is built upon the knowledge of the previous ones.

Although these steps are listed in sequence for reference, you will discover that in practice, they are more iterative and that you will often be cycling back to a previous step. This often happens when you discover information, which is in conflict with what you have previously discovered.

As an example, many times you believe that you are finished with the data preparation stage, only to find during the modeling stage that you have discovered a glitch in the data collection, and that you need to perform more data prep to accommodate certain conditions. One solution might be to cycle back, try to remedy the data situation, while at the same time see how you can continue with your modeling. This often entails 'coding around' the problem by setting flags, or maintaining different include files, versions, and so on. It always pays to code defensively when dealing with data.