# Predictive Analyses with Excel

• ## Data Types

When you open and examine a data file, notice that not all factors (a.k.a. "variables") are the same. Some are numeric and continuous like income or age. Others are numeric, but ordinal like "commute distance" meaning that the numbers are broken into ordered "chunks" (e.g. 0-1, 1-2, 2-5, 5-10). Others are ordinal, but categorical rather than numeric like education. In order words, "high school" is greater than "partial high school." However, the data is text rather than numeric. Also, the difference between categories is not necessarily the same. For example, can you say for certain that the difference between "partial high school" and "high school" is the same as the difference between "high school" and "partial college" (answer: no)?

Lastly, some categorical data has no order whatsoever. We can divide these types into dichotomous categorical (e.g. "gender," "home owner," and "purchased bike") versus nominal , or those with more than two categories (e.g. "region" and "occupation"). One of the most common nominal factors that is not included in this particular data set is consumer ethnicity. As you'll find out below, not all analyses can be performed with every data type.

Some variables can be converted between numeric and categorical (e.g. dichotomous variables like "home owner" can be turned into 0 and 1 and ordinal variables like "education" can be turned into integers (e.g. 1 = partial high school, 2 = high school, 3 = partial college, 4 = bachelors, 5 = graduate). However, other variables, like region, cannot. Be aware of these data types as you perform data analyses.

• ## Bivariate Statistics

Bivariate statistics refers to the analyses we perform that measure the relationship between two variables. These bivariate relationships are relatively simple to visualize and analyze.

The first step to analyze this data is to visualize it. Two-dimensional visualizations, the most common and easiest to visually interpret, are those that show the relationship between two variables. If the two variables are continuous, then scatter plots are the preferred visualization.

• ## Scatter Plots

First, before adding the trendline, is there a clear pattern to the plots? Yes and no. You may have been able to decipher that there is a positive relationship between the two variables: meaning that the dots seem to "scatter" from the lower left to the upper right. That means that as one variable increases, so does the other one--to a degree. In other words, as people get older they make more money. However, you may not have been able to tell that without adding the trendline which clearly increases. Second, the scatter plot appears to be "grouped" into chunks. Why is that? It's because our data for income has been rounded to the nearest 10000. As a result, that variable is not truly continuous, it's ordinal. That's okay though.

Lastly, does this chart tell us that age causes income, or that income causes age? Although it is obvious that age leads to income (and not the other way around), the answer is neither. This is referred to as causal ambiguity. The visualizations and statistics don't tell us which factor causes which. Our theoretical explanation tells us causality, not the statistics.

• ## Slope

The slope is a measure of how much the y-value changes, on average, for each unit of increase on the x-axis. Slope is calculated as the rise over the run.

The slope of a line is a number that describes both the direction and the steepness of the line. Slope is often denoted by the letter m, such as in the equation y = mx + b. As shown below, the direction of a line is either increasing, decreasing, horizontal (no slope) or vertical (undefined slope).

The line of best fit is displayed on the scatter plot below. According to this equation, as height increases by one inch, self-esteem increases, on average, by 0.0707.

• ## Bar Charts

If one of the variables is not numeric, then a scatter plot is no longer useful. A bar chart will need to be used to sum up (or average, etc.) the values of one variable based on the categories of another.

• ## The Correlation Coefficient

If the relationship between variables was always perfectly linear, the equation y = mx + b, which includes the slope would be all that would be required to predict the value of one variable from the value of the other variable. Typical relationships in business and many other content domains are not perfectly linear. As in the example above, observations are scattered around the line of best fit. So we need an additional statistic.

The correlation coefficient is one of the most common and useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The Pearson correlation, denoted by the symbol (r), describes the degree of linear relationships between two arrays of numbers

The formula for the correlation is shown below. Even though we have computers typically calculate correlation for us, the details of the calculation are included so you can see how correlation is calculated. Like all statistics, it helps to understand how they are calculated because it helps you know what the statistic actually means.

It's easier to understand correlation coefficients by examining the scatter plot chart. Correlations vary from -1 to 1. "Stronger" correlations are those which are closer to -1 and 1. Values closer to 0 are "weaker." In other words, a correlation coefficient of -0.5 is equally as strong as 0.5. The only difference is that -0.5 means that as one variable increases , the other decreases about 50% of the time. On the other hand, 0.5 means that as one variable increases , the other also increases about 50% of the time. That's not a perfect interpretation, but it's close enough. Take a look at the scatter plots and correlation coefficients below:

yThese first three are examples of perfect correlation. Notice that the actual slope of the line doesn't matter. It can be positive, negative, and not an integer. What's important is that, as x changes, the degree to which y changes by a predictable amount is the correlation. However, in real life, if you have two variables with a correlation of -1 or 1, then you are basically making a scatterplot of a variable against another version of itself (e.g. birth year and age). In other words, -1 and 1 are hypothetical limits. If we actually find a perfect correlation in practice, then we need to "throw out" one of them for future analysis. This is because there is a rule (see Regression later in this chapter), that you can't include variables in a model that are too highly correlated.

These next correlations are strong, but not perfect. However, you can interpret them roughly as "a unit change in x is 91 or 87 percent likely to result in a unit change in y." Or, vice versa. Also, remember that just because one variable is along the Y axis and the other is on the X doesn't imply that X is causing Y.

These correlations are weak relative to those in the scatter plots above. However, they may still be statistically significant. Other tests are required to say conclusively whether these relationships are due to chance or due to an actual--albeit weak--relationship between the two variables. However, notice that no matter how strong or weak a correlation coefficient is, you can still draw a line of best fit representing the slope.

This is an example of no correlation whatsoever where r = 0.0. Notice that each of these scatter plots also include another statistic (R 2 ). We will discuss this later.

Now it's time to calculate a correlation table in Excel. Watch and follow along with the video to learn how this is done.

If you just need to calculate the correlation between two variables and you don't need an entire matrix. If so, try the CORREL function in Excel. All you need to do is input two continuous columns of values. For example: `=CORREL(A1:A10, B1:B10)`.

• ## One-Way Analysis of Variance (ANOVA)

The Pearson Correlation Coefficient can only be calculated between pairs of numeric variables. Some categorical variables can be converted to numbers (e.g. education) and some cannot. For example, Education has an ordering to it: partial high school is less than high school, high school is less than partial college, etc. Therefore, you could convert education to numbers: 1 = partial high school, 2 = high school, 3 = partial college, 4 = bachelors, 5 = graduate, etc. However, that's not a perfect conversion. For example, is the progress required to move from partial high school to high school the same as the progress required to move fro bachelors to graduate? Maybe, maybe not. We don't know.

By converting education to numbers, we are assuming that it must be the same difference because each type of education increases by an integer of 1. Sometimes it's better to leave an ordered variable, like education, as a categorical variable with no ordering assumed among values. Other times, we have no choice. If you try to conver North America = 1, Europe = 2, abd Pacific = 3, then the statistical algorithm thinks you mean that Europe is 1 unit more "of a region" than North America (because 2-1 = 1) and the Pacific is 2 units more "of a region" than North America. That makes no sense. We need a formula that will treat all categorical values as equal, but different from each other.

A one-way Analysis of Variance (ANOVA) is a test that will not only measure the relationship between a categorical variable and a numeric variable, but also estimate whether that numeric variable is statistically different across categories. Follow along with this example:

ANOVA tells you whether there is a statistically significant difference among all of the categorical values. In the case above, different levels of education are statistically related to variations in income. In other words, more education means more money. You know this because the ANOVA produces an F-statistic with an associated p-value.

an F-statistic (labeled " F" in the Excel output) is a measure of the difference in some numerical variable (e.g. income) across all possible values of a categorical variable (e.g. education) that is due to actual differences between the categories and not to just random chance. Larger F-statistics indicate that the differences among categorical variable values are more likely due to true differences among categories.

F-statistics have associated p-values, which are the probability of finding that same--or even stronger--value assuming that there is no actual relationship between the variables. In this case, "value" refers to the F-statistic produced by the ANOVA. However, p-values can be generated for Pearson correlation coefficients and many other statistical values as well (e.g. see the regression coefficients you'll generate later in the chapter).

• ## Comparing Pairs of Groups: T-Test

ANOVA results do not tell you if one particular group is significantly different from another. One way to compare specific pairs of groups is by performing a t-test. As with the F-statistic produced by an ANOVA, the t-test also produces a p-value indicating whether the difference between a pair of groups is due to chance (p > 0.05; i.e., partial high school and high school are basically the same) or due to the nature of the difference between to categories (p < 0.05; i.e., a full high school degree actually makes you smarter--and, therefore, earn more money--than a partial high school degree). See the video example below:

Watch this video for more info on p-values (but only if you need to):

• ## Multivariate Statistics

The analyses we have performed so far only estimate the relationship among pairs of variables. [Side note: even though there are several categories of eduation, we were still only examining two variables: education and income] However, that is only a start. Variables don't exist "in a vacuum", or in isolation from other variables. There are typically MANY variables involved in explaining a phenomenon. In other words, we need to move beyond bivariate statistics to multi-variate statistics.

As we move beyond pairs of variables, it's also time to designate variables as dependent (y) versus independent (x). In particular, we want to designate one (and only one) variable as the dependent (a.k.a. "y") variable--which is the variable that we want to explain or predict. Next, we need to designate one (and up to many) variable(s) as the independent (a.k.a. "x") variable(s)--which are those variables which we will use to explain or predict the dependent variable.

For example, the dependent Y variable is typically something that is valuable to predict like whether or not someone will purchase one of our bikes (see the variable PurchaseBike: 0 = no, 1 = yes in the data set provided). There will only be one dependent variable in our examples. the independent X variable(s) are those which would theoretically predict the dependent variable. That is the entire purpose of all of the other variables in the bike buyers data set. For example, people with more income are more likely to purchase bikes. People who live closer to work are more likely to purchase bikes (for commuting). Rather than examine how one independent variable at a time relates to the dependent variable (e.g. a Pearson correlation coefficient), we want to know what the combined effect of all independent variables together is on the dependent variable. To accomplish this, we need to move beyong the Pearnson correlation coefficient (r) to the coefficient of determination (R2)

• ## The Coefficient of Determination

You may have noticed in the scatter plots above that we also calculated a statistic called R2 in addition to r. This is called the coefficient of determination which is a key output of fitting a line of best fit from a scatter plot. It is interpreted as the proportion of the variance in the dependent (y) variable that is predictable from the independent (x) variable(s). However, R2 can also be calculated between a Y variable and a set of X variables.

Notice that the types of analyses that take place at each level within each area. Sales and marketing use a TPS for order processing. The products purchased, their quantities, the customer, and a timestamp are all recorded in the order processing TPS. Marketing managers need to know if their new products are selling well, where they are selling, which marketing schemes are working, etc. So their MIS and DSS will summarize sales data by all of these factors and show them trends over time. Upper management needs to decide if they should even be in certain markets, so they will look at even more data over time and come up with long-term plans and forecasts. Notice that each other functional area follows a similar pattern of systems.

For example, let's conceptualize the correlation coefficient as the amount of variance in one variable that overlaps with another variable. However, we want to ignore whether the relationship is positive or negative. Rather, we simply want to know how much of a Y variable can be explained, or predicted, by an X variable. See the diagram below:

If we square the correlation coefficient r = .14 for that relationship (see the correlation table created earlier), we get an R2 value of 0.02. In other words, 2% of the variance in PurchaseBike can be explained by variance in Education which is represented (although a bit exaggerated) by the overlap between the two circles above. However, we have collected many variables which might overlap with, or explain, PurchaseBike. We have also included CommuteDistance in the figure below:

There are two things to learn from this image. First, R2 is a representation, not only of the effect of a single X variable on a Y variable, but also the total summed overlap of all X variables on a Y variable. Imagine adding a circle to that diagram for every factor we have measured in the Bike Buyers data set. The total overlap of all variables with PurchaseBike is the R 2 value we are interested in.

Second, notice that commute distance is correlated not only with PurchaseBike, but also with the other independent variable Education. In addition, part of the overlap between Education and PurchaseBike is also overlapped with Commute distance. So is the true relationship between Education and PurchaseBike best represented by the correlation coefficient between those two variables? No, it's better to analyze the effects of a set of X variables at once in order to see what individual effect each independent variable has on a dependent variable after removing, or "controlling for," the effects of all other variables.

In the figure below, the true effect of Education is represented by only the portion that doesn't overlap with all other independent variables. So how do we measure just that portion that is due only to education? That is one of the purposes of multiple regression.

• ## Multiple Linear Regression

Regression is a powerful statistics analysis that allows you to measure the relationship between a dependent (output) variable and, not just one, but a set of independent (input) variables. As a result, the effect of each independent variable is controlled for by the effects of the other independent variables.

In linear regression, data is modeled using linear predictor functions (think about drawing a straight, or "linear", line through the data). This allows unknown model parameters to be estimated from the data. As a result, multiple linear regression is a great first step toward predicting unknown future "Y" values based on a set of known existing "X" values. Using the previous data set, follow along with the video tutorial to see how a basic prediction calculator can be produced in Excel.

As you can probably tell if you followed along with the video above, multiple regression-based prediction calculators are somewhat complex, but also EXTREMELY powerful tools that are rarley used in practice by the "average employee" simply because they don't understand them or don't realize how easy they are to create in Excel. Based on our recent interviews with companies hiring our students, we highly recommend you spend the time to learn this technique and think through how you can apply it in other data sets.