The following post is from a class discussion on preparing data for Statistical Modeling I wrote for the Predict-420 class in the Master of Science in Predictive Analytics (MSPA) program at Northwestern University.
What does the data look like?
Regardless of the type of data, Structure, semi-structured or unstructured, you need to know what kind of data it is. Are we dealing with Time Series, Cross-Sectional, Pooled or Panel data? Once you have an idea of the type of data, you have to ask the question, what will the data be used for?
We also want to get to know the data a little better by running some statistics like Univariate and Bivariate Statistics on the raw data. This will show us some of the relationships in the data even before getting into the cleaning and formatting process. This is not the only time to do this, but something that I recommend doing along the process of working with the data all the way to when you start modeling.
What is the business problem we will solve?
Understanding the business problem you are trying to solve is crucial. By knowing what the business problem is, you can make sure that your model supports that, and that you optimize for the right metric. Understanding the business problem also allows you to pick the right variables as well as later, which dummy variables, indices, categories etc., you need to create and how to go about imputation.
I’ve heard arguments that you should capture all of the data, because you might need it later for secondary or even tertiary insights which are very often not known at the time the data is captured. I think that this still falls within the “knowing the business problem” rule since if that is the case that you expect later usage of the data, you should find a cheap storage medium like Amazon’s S3 Glacier storage where you don’t have to do any data munging, but just store the data in archive form until the conditions are ripe to reuse it. As for modeling data, I have to assume there’s a business case more appropriate than just “perhaps one day we might need it”.
What are your intentions with the data?
This is an important question to ask yourself, and it leads you to question how you are going to approach randomization and sampling for training and testing data. This is also the time to determine if you will be doing Single or Multiple factor design, and if you are doing Multi Factor design, you need to concern yourself with the implications of Nested and Cross Factors. You should also have an idea of what to expect and how to handle potential confounders and control variables. Finally, what will be the best approach to sample design, using simple random sample, stratification or clustering?
These are questions you should try to answer as best you can, so that you head into the next step of cleaning up your data knowing what you want to accomplish with that process.
Cleaning, coding and imputing.
From experience, the upfront cleaning and preparation of data along with post model creation checking for multicollinearity, outliers and influential points, missing data, truncation and censoring fits the 80/20 rule perfectly. These are important steps, and includes many different processes where best guesses are often used. That is why it’s so important to first understand the data and it’s use before attempting to clean, code and impute it.
Dealing with missing data often requires you to make decisions, as to what will be done with those “N/A” values. Do you replace them, leave them out, and what happens if you do replace or leave them out with your statistical measures? What about the large outliers and influential variables? Do you replace or remove them, or perhaps use binning to smooth out their effects? At every step, you have to ask how that will affect your model creation. It is however important to know if you should concern yourself with missing values at all. Some Predictive Modeling techniques can handle missing values such as:
- Decision Trees (CHAID, CART, C4.5)
- Gradient Boosting
- Random Forests
But, if you are using one of the following Predictive Modeling techniques, you have to deal with these missing values ahead of time since they don’t handle missing values very well.
- Regression (Linear, Logistic, GLM, etc)
- Neural Networks
- Support Vector Machines
Most likely, you will not just be missing data, but also have to decide if you really do need all of the data? Using automated variable selection methods such Adjusted R-Squared, Mallow’s Cp, Forward, Backward and Stepwise can help you cut down on the number of variables to include, but it won’t necessarily help you with the next step, which is to create bins and dummy variables. This process leads directly into the next, which is our initial EDA where we will check the normality and investigate the residuals for the data.
But what do you do after imputing values and limiting the factors included you still see problems with the data such as outliers? There is really no straight forward answer, but some general guidelines to follow. You might decide that all you need to do is limit the effects of the outliers on the data. There are several ways you can do that such as:
- Increases the size of the dataset which will reduce the effect of outliers
- Increase the number of variables (factors) used in the model will also reduce the effect of the outlier
This is however not always an option to have more data, so we have to resort to some additional methods which include:
Building multiple models that address the different clusters of data. For example, you have a model for lower to upper-middle-income people ($45k – $250k), a model for the affluent and wealthy ($251k to $1 million) and a 3rd model for the super wealthy. As you probably realize in this example, even data at the $10milllion income level will be skewed by the billionaire class. That is why looking at the data distribution is needed to help you identify the appropriate clusters to use for different models.
You can also change the models you planned to use for models that deal well with outliers such as:
- Decision Trees
- Random Forests
- Gradient Boosting.
You have to know the implications of these models since some of them might have other complications and complexity in their interpretation.
Finally, you can also transform the outliers using the following set of methods:
- Truncate values at a certain value
- Log Transforms
- Binning Data
- Combine several techniques (i.e. “log followed by binning”)