Related Posts by Johann:

Preparing your data for modeling

The following post is from a class discussion on preparing data for Statistical Modeling I wrote for the Predict-420 class in the Master of Science in Predictive Analytics (MSPA) program at Northwestern University.

What does the data look like?

Regardless of the type of data, Structure, semi-structured or unstructured, you need to know what kind of data it is. Are we dealing with Time Series, Cross-Sectional, Pooled or Panel data? Once you have an idea of the type of data, you have to ask the question, what will the data be used for?

We also want to get to know the data a little better by running some statistics like Univariate and Bivariate Statistics on the raw data. This will show us some of the relationships in the data even before getting into the cleaning and formatting process. This is not the only time to do this, but something that I recommend doing along the process of working with the data all the way to when you start modeling.

What is the business problem we will solve?

Understanding the business problem you are trying to solve is crucial. By knowing what the business problem is, you can make sure that your model supports that, and that you optimize for the right metric. Understanding the business problem also allows you to pick the right variables as well as later, which dummy variables, indices, categories etc., you need to create and how to go about imputation.

I’ve heard arguments that you should capture all of the data, because you might need it later for secondary or even tertiary insights which are very often not known at the time the data is captured. I think that this still falls within the “knowing the business problem” rule since if that is the case that you expect later usage of the data, you should find a cheap storage medium like Amazon’s S3 Glacier storage where you don’t have to do any data munging, but just store the data in archive form until the conditions are ripe to reuse it. As for modeling data, I have to assume there’s a business case more appropriate than just “perhaps one day we might need it”.

What are your intentions with the data?

This is an important question to ask yourself, and it leads you to question how you are going to approach randomization and sampling for training and testing data. This is also the time to determine if you will be doing Single or Multiple factor design, and if you are doing Multi Factor design, you need to concern yourself with the implications of Nested and Cross Factors. You should also have an idea of what to expect and how to handle potential confounders and control variables. Finally, what will be the best approach to sample design, using simple random sample, stratification or clustering?

These are questions you should try to answer as best you can, so that you head into the next step of cleaning up your data knowing what you want to accomplish with that process.

Cleaning, coding and imputing.

From experience, the upfront cleaning and preparation of data along with post model creation checking for multicollinearity, outliers and influential points, missing data, truncation and censoring fits the 80/20 rule perfectly. These are important steps, and includes many different processes where best guesses are often used. That is why it’s so important to first understand the data and it’s use before attempting to clean, code and impute it.

Dealing with missing data often requires you to make decisions, as to what will be done with those “N/A” values. Do you replace them, leave them out, and what happens if you do replace or leave them out with your statistical measures? What about the large outliers and influential variables? Do you replace or remove them, or perhaps use binning to smooth out their effects? At every step, you have to ask how that will affect your model creation. It is however important to know if you should concern yourself with missing values at all. Some Predictive Modeling techniques can handle missing values such as:

  • Decision Trees (CHAID, CART, C4.5)
  • Gradient Boosting
  • Random Forests

But, if you are using one of the following Predictive Modeling techniques, you have to deal with these missing values ahead of time since they don’t handle missing values very well.

  • Regression (Linear, Logistic, GLM, etc)
  • Neural Networks
  • Support Vector Machines

Most likely, you will not just be missing data, but also have to decide if you really do need all of the data? Using automated variable selection methods such Adjusted R-Squared, Mallow’s Cp, Forward, Backward and Stepwise can help you cut down on the number of variables to include, but it won’t necessarily help you with the next step, which is to create bins and dummy variables. This process leads directly into the next, which is our initial EDA where we will check the normality and investigate the residuals for the data.

But what do you do after imputing values and limiting the factors included you still see problems with the data such as outliers? There is really no straight forward answer, but some general guidelines to follow. You might decide that all you need to do is limit the effects of the outliers on the data. There are several ways you can do that such as:

  • Increases the size of the dataset which will reduce the effect of outliers
  • Increase the number of variables (factors) used in the model will also reduce the effect of the outlier

This is however not always an option to have more data, so we have to resort to some additional methods which include:

Building multiple models that address the different clusters of data. For example, you have a model for lower to upper-middle-income people ($45k – $250k), a model for the affluent and wealthy ($251k to $1 million) and a 3rd model for the super wealthy. As you probably realize in this example, even data at the $10milllion income level will be skewed by the billionaire class. That is why looking at the data distribution is needed to help you identify the appropriate clusters to use for different models.

You can also change the models you planned to use for models that deal well with outliers such as:

  • Decision Trees
  • Random Forests
  • Gradient Boosting.

You have to know the implications of these models since some of them might have other complications and complexity in their interpretation.

Finally, you can also transform the outliers using the following set of methods:

  • Truncate values at a certain value
  • Log Transforms
  • Standardization
  • Normalization
  • Binning Data
  • Combine several techniques (i.e. “log followed by binning”)

Useful Python Snippets by Becky Sweger

This is worth saving here for future use.

I’ll probably branch and add some of my own later (when I don’t have 1,000 other things to do :-))

Beyond search and recommendation engines

Johann Beukes - Triumph Motorcycle

When searching for information, being it for a book recommendation, what the next hot tech product will be or the best recommendation for a restaurant in your neighborhood, the traditional way was to start with a search engine like Google, Bing or Yahoo. With the proliferation of smart phones, and with that dedicated apps that performed mostly single purpose recommendations, the paradigm still remained the same. As a consumer, you had to trust that when you ask Google.com or Yelp about the best Indian Restaurant near you, the results would be unbiased and fit your expectations.

Though we can assume that for the most part, this would be true, the one important thing to realize is that any recommendation should be taken with a grain of salt. Why you may ask? Because of the profit motif. In the era of Google Ad-Sense and syndicated ad-networks that track you across the web to present more targeted advertising, identifying alternative options to finding information other than traditional avenues are important.

This is where Social Media became a big player. I’m not talking about Facebook or Twitter’s ad platforms, but rather, about the connections you form in those social networks. The study of these social networks focuses on the social structure of relationships around a person, group, or organization affects beliefs or behaviors. Translated, this means shared attributes and levels of trust that cannot easily be formed by companies using branding. Perhaps Apple showed great promise forming a cult around its products that highlighted a high level of trust and shared attributes among its followers. This is however not an easy feat, and most companies will never attain such cult status for their product offerings.

So this leaves a company with two options. Either bow to the always-changing guidelines and rules around search engine optimization (SEO) or focus efforts on getting your products into the social networks of target costumers. Perhaps the best known term for the second option is getting your product to be “viral” which indicates it’s spreading like a disease around social networks. This by itself has become a huge area of investment for companies as well working to ensure their products are being liked, shared, re-tweeted etc.

So then the question is – what do you need to enter this area of product promotion that does not rely on a Google algorithm to ensure you rank higher on a search page than the competition? The answer as I see it; Social and Behavioral Psychology to research and create models that can be used to test hypothesis, using tools such as Statistical Modeling, Predictive Analytics and Big Data. A few years ago I came to this realization and found myself going back to school to study Psychology with a emphasis on Social, Behavioral and Cognitive areas that I can utilize to analyze and create models to both test hypothesis and make predictions. Currently, I’m working on the second part of my toolset, and that is being able to take action on the theoretical aspects of Behavior Modeling by studying Predictive Analytics at Northwestern University. The Masters of Science in Predictive Analytics (MSPA) program is one of only a few applied statistical programs available from top tier schools in the USA, and with the predicted shortcoming of Data Scientists in the future it is perhaps one of the best investments I have made in myself in a long time.

With this blog, I will share various topics that I find interesting, as well as discoveries I make while working through the MSPA program as well as professional experience in the areas of Big Data, Data Science and Software Engineering. Naturally, I have some personal interest that I engage in and is part of my life such as the Martial Arts, Zen Buddhism and healthy living through exercise and a cruelty free diet. Thank you for reading all the way to the end of this firs post on my new blog site and feel free to comment and connect with me on the various social sites.