If the relationship between variables was always perfectly linear, the equation y = mx + b, which includes the slope would be all that would be required to predict the value of one variable from the value of the other variable. Typical relationships in business and many other content domains are not perfectly linear. As in the example above, observations are scattered around the line of best fit. So we need an additional statistic.

The **correlation coefficient** is one of the most common and useful statistics. A correlation is a single number that describes the degree of relationship between two variables. The Pearson correlation, denoted by the symbol (r), describes the degree of linear relationships between two arrays of numbers

The formula for the correlation is shown below. Even though we have computers typically calculate correlation for us, the details of the calculation are included so you can see how correlation is calculated. Like all statistics, it helps to understand how they are calculated because it helps you know what the statistic actually means.

It's easier to understand correlation coefficients by examining the scatter plot chart. Correlations vary from -1 to 1. "Stronger" correlations are those which are closer to -1 and 1. Values closer to 0 are "weaker." In other words, a correlation coefficient of -0.5 is equally as strong as 0.5. The only difference is that -0.5 means that as one variable increases , the other decreases about 50% of the time. On the other hand, 0.5 means that as one variable increases , the other also increases about 50% of the time. That's not a perfect interpretation, but it's close enough. Take a look at the scatter plots and correlation coefficients below:

yThese first three are examples of perfect correlation. Notice that the actual slope of the line doesn't matter. It can be positive, negative, and not an integer. What's important is that, as x changes, the degree to which y changes by a predictable amount is the correlation. However, in real life, if you have two variables with a correlation of -1 or 1, then you are basically making a scatterplot of a variable against another version of itself (e.g. birth year and age). In other words, -1 and 1 are hypothetical limits. If we actually find a perfect correlation in practice, then we need to "throw out" one of them for future analysis. This is because there is a rule (see Regression later in this chapter), that you can't include variables in a model that are too highly correlated.

These next correlations are strong, but not perfect. However, you can interpret them roughly as "a unit change in x is 91 or 87 percent likely to result in a unit change in y." Or, vice versa. Also, remember that just because one variable is along the Y axis and the other is on the X doesn't imply that X is causing Y.

These correlations are weak relative to those in the scatter plots above. However, they may still be statistically significant. Other tests are required to say conclusively whether these relationships are due to chance or due to an actual--albeit weak--relationship between the two variables. However, notice that no matter how strong or weak a correlation coefficient is, you can still draw a line of best fit representing the slope.

If you just need to calculate the correlation between two variables and you don't need an entire matrix. If so, try the CORREL function in Excel. All you need to do is input two continuous columns of values. For example: `=CORREL(A1:A10, B1:B10)`

.