As it turns out, it's not all that difficult to calculate measures of our past performance. However, what if we can predict the future? Then, we can outsmart the competition by making the products that consumers really want (perhaps before they know they want them), recognizing viruses before they are a "known" virus, knowing if a consumer is going to buy our product before we give them the sales pitch, and much much much more. This is the purpose of predictive tools and techniques.

Predictive data analysis requires relatively complex statistical formulas that use historical data to make predictions about the relationships between sets of variables. There are many forms of, and formulas for, these analyses. We'll review the main ones here.

### Detecting Categories

Detecting categories is the process of "clustering" records (remember records = rows = instances) in a database into groups of related records. For example, consider the number set: 1,1,1,1,2,2,5,5,5,5,6,6,6,6,9,9,9,9,10. How many clusters of related numbers are there? Hopefully you said three. Let's assume those numbers refer to the number of products that each of a set of customers purchased during their last visit to our website. What if we have more data on each customer like the number of days since they made those purchases? That would require us to plot the values of those two variables for each customer like the image below:

What if we have three variables for each customer? See image below:

What if we have 35 variables for each customer? This is quite possible. However, there's no way to visualize clusters based on 35 dimensions. Yet, statistical algorithms can conceptualize 35 dimensions in your computer's memory and summarize your customers into as few as 2 (or many) basic clusters. Clustering analyses will not only tell you how many clusters were found, but also the primary characteristics (attribute:value pairs) of each cluster. This will allow you to group your customers into segments and create unique strategies for each segment.

### Analyzine Key Influencers

**Key influencer analysis** is one of the most common techniques and involves measuring the correlation (i.e., the predictive ability) of a set of independent (a.k.a. "x") variables on typically one dependent (a.k.a. "y") variable. This is a great way to find out, for example, what characteristics of potential customers are related to their level of repeat purchases in your store. For example, as customers have more education, they may be more likely to make a purchase (represented by the line of best fit through the scatterplot below). However, it is very important to be wary of assuming causality even though you may find statistically significant results. For example, it may not be a customer's education that causes them to make purchases; but rather, their education led to greater income which led to purchases.

There are several statistical formulas used to make these kind of predictions. The image below depicts a regression analysis. Other formulas include Naive Bayes, Decision Trees, Neural Networks, and more.

### Forecasting

Forecasting is the process of predicting future values over interval time periods based on known, measured values of the same interval periods. As a result, forecasting always has a standard time period and is charted over time. Sales revenues, profit, costs, and market demand are among the most common measures forecasted over time (i.e., time-series). The ARMA (autoregressive moving average) and ARIMA (autoregressive integrated moving average) formulas are among the most common statistical formulas used for forecasting.

### Market Basket Analysis

Market basket analysis is a popular analysis for predicting consumer shopping patterns. In particular, it involves examining the products that have been grouped in the past by consumers (e.g., in a "shopping basket") and using that information to predict related items that each customer may want to purchase based on the shopping baskets of other customers who bought similar products (see image below). The statistical technique used to perform market basket analysis is called "association analysis." If you've ever visited amazon.com, then you've seen market basket analysis as new products are always suggested based on the product you are viewing.

### Other Tools

The statistical formulas used in the analyses above can also be used to improve other important steps in data analysis. For example, if you have missing consumer data in your analytical database, many statistical formulas will automatically ignore all of that consumer's data because they require complete data to work at all. Let's say that only 8 percent of your customers have completed their entire online profile. It would be very sad to have to ignore the other 92 percent of your customers. So what options do you have? Well, you can start paying for external databases which might be able to fill in the gaps. However, those databases often have the exact same info you have. Another option is to fill in the missing values of each customer with the average of all other customers. While this will allow you to use more of your data, it will likely reduce the strength of your relationships. More recently, a popular technique has been to use the same statistical analysis used in key influencer analysis (e.g., regression) to predict the most likely value of the missing data based on the actual values of all other attributes of the record. For example, if you know that a customer is male, age 20, not a home owner, and works part time, your statistical regression model is likely to also predict that this person has a partial college education. While it is definitely possible that this prediction is wrong, it is much more likely to be accurate than using simply the most common value for education found in the data.

Another useful technique is to use the clustering algorithms found in category detection tools to identify records that are outliers. For example, your customers may have an average income of $75,000 per year. Therefore, a customer making $200,000 per year may appear to be an outlier. Removing outliers from the data is a great way to improve your predictive power. However, a clustering algorithm would examine all of the other attributes of this seeming outlier in concert and find that because that customer has a graduate degree and 30 years of full-time work experience, they are well within the normal range of customers. Similarly, another customer who earns the exact average of $75,000 would be identified as an outlier if they are 16 years old.

In summary, statistical formulas and techniques have become a mainstream in today's BI stack. Creating new and useful ways to integrate statistical prediction into your business processes is a great way to save costs, increase revenues, and become noticed by your managers.