Introduction to the Business Intelligence (BI) Stack

  • Cause and Effect

    What causes one business to be successful and another to fail? This is the question that drives the field of business management. Business management (and particularly information systems) is actually a very creative discipline. There are no rules other than "do not break the law." Therefore, organizations have very wide parameters within which they can develop creative ways of achieving "above-average" gains in a market. There are a few ways to come up with good ideas:

    1. Take a complete random guess and hope that it works
    2. Draw from your past experience (and that of your co-workers and employees) to base new ideas upon
    3. Gather data about your performance and use it to determine cause and effect. This is what this section will discuss.

    The "effect" that we are interested in is business success. That's obvious (even though success can be broken down into many, many specific and measurable outcomes). What's less obvious is the "cause" of success. This is where you have to be careful. Data allows us to measure hypothesized causes and desired successes. However, it cannot determine the true cause of each effect. It only gives you "support" for a theorized cause/effect relationship. Consider for example this chart depicting accurate data:

    Types of Information Systems

    Does organic food cause autism? Probably not. In fact, this ridiculous chart was made as an example of how data can be terribly misinterpreted as well if you don't know how to use it. Therefore, the purpose of this chapter is to give you a high-level overview of the various ways in which we use and analyze data in an organization.

  • Business Intelligence and Analytics

    Analyzing data to improve business decision making is a big field with many terms. So, let's define some of the common terms that are often used interchangeably but have semantic differences. Business intelligence (BI) is a term with a broad meaning that generally refers to the process and technology tools used to transform raw data into meaningful and useful information that supports business decision making. More recently, the term "BI" is used to refer more specifically to the portion of the process that describes and reports on the past and existing state of the business, whereas the term business analytics, a subset of BI, is often used to refer to the statistical analyses performed to help predict the future .

    However, these terms are often used interchangeably. In addition, each term is often used to refer to all data analytics, purposes, techniques, and tools rather than a single subset, thus making things even more complicated. The reason for the frequent ambiguity between these terms is because this area is changing faster than perhaps any other type of organizational technology. Recent estimates place BI projects as being 40-150 percent more profitable than any other type of IT project. As a result, the rapid growth and technology revolution in BI is likely to continue for the foreseeable future.

    BI Stack

    let's begin by outlining the entire BI stack, which is the set of technologies (hardware and software) that either support or directly offer data description and analytics capabilities.

    Types of Information Systems

    The image above gives an overview of each conceptual component of the BI stack while the image below (from http://hortonworks.com/open-enterprise-hadoop/) summarizes several of the major product vendors for each component. While you don't need to memorize these brands, it would be useful to familiarize yourself with them (particularly the applications on top) because you are the primary user of those tools.

  • Data Sources

    Data sources are "where data comes from." Clearly, much of our data will come from our operational databases which were the focus of the prior module. Operational databases keep track of every piece of data necessary for our business processes in a non-redundant (a.k.a. "normalized") relational database. However, there are a lot of other sources of good data that can be used to create business intelligence to aid decision making.

    Another important source of data is the research we hire consulting firms to perform or that we perform ourselves. For example, we may administer a survey on our company website. Or, we may hire a firm to call our customers and ask them about their satisfaction with our products. Similarly, we likely have supply chain partners who are incented to see us succeed, So they will often offer us access to some of their data which may be useful.

    Next, it is very common to purchase access to third-party databases. Consider the major credit bureaus (Equifax, Transunion, and Experian). They offer consumer credit information for sale. The United States Post Office sells access to address information. A company can use these data sources to "fill in the gaps" in its own consumer data collection.

  • Extract, Transform, Load (ETL)

    it's a bad idea to perform complex analyses on the operational database. Operational DBs need to be fast so that business can run quickly. They also need to be accurate. As a result, you can't have people querying an operational database for ad-hoc or "non-core business process" reasons and possibly messing up the data. Therefore, we copy the data from the operational databases. This is called the extract/transform/load (ETL) process.

    Extract

    Extract (or export) means that the data is copied from the operational database(s) and other data sources.

    Transform

    Transform means that we do not want to load the data "as-is" into another database. We don't need that same level of detail for our analyses. Therefore, we will typically summarize it in some ways. For example, rather than store each individual sale, we will store the daily sales by product line, salesperson, store location, etc.

    The "transform" step in ETL also includes significant data cleaning . Data cleaning is the process of improving the quality of the data. Think back to our discussion about the characteristics of high quality information.

    For example, one important concern with ETL is how often it should occur. Should you ETL every single sale as it happens in real-time? Probably not, because that would require significant extra computing power. So, should you ETL once a month? Well, it depends. Is the value of having the most recent month greater than the cost of running an ETL batch once per month? If so, then you should probably run it more often. In other words, that decision affects the timeliness of your data.

    Can you design your ETL process to query the United States Postal Service database and fill in any missing data (to be added into your data warehouse, not your operational customer DB which they see)? Yes, you can. That would improve the missing values problem and possibly any accuracy problems with their contact information. In fact, there are many other databases you can pay to access and program into your ETL process to improve or clean our data.

    Load

    Lastly, load simply means that the data is then pasted into a new database, usually a data warehouse.

  • Data Warehouses and Marts

    Data Warehouses

    The data warehouse (a.k.a. analytical database) is where we load the recently transformed data. Data warehouses are also relational databases of a sort. However, the same rules of normality (non-redundancy of data) no longer apply. Data warehouses have their own rules and schemas (e.g. the "star" schema) that are optimized for ad-hoc data analyses.

    In the image above depicting common vendors, the middle layer is labeled "data systems" because many of today's most popular technologies for data warehousing include a data integration and staging layer in between the ETL and data warehousing layer. Hadoop is currently the most popular technology for integrating data from a wide range of sources into a data warehouse.

    Many industry experts believe the days of the data warehouse are numbered. Hardware has progressed to the point where the entire data warehouse can exist entirely in memory (think massive amounts of RAM). This means that there is no physical model of the data warehouse. The ETL process simply loads the analytical data into the memory of servers that are "always on." Essentially, this means that the data is simply being pushed from slow secondary storage (hard drives) to fast primary storage (RAM). On one hand, this is more efficient because it doesn't require a second database with its own copy of the data. On the other hand, it's riskier because servers can crash and need rebooting (thus, erasing the temporary RAM and requiring that the entire analytical database be re-created). However, that may be an acceptable risk since analytical data is not necessary for running core business processes.

    Data Marts

    Data marts give limited role-based access to certain portions of the data warehouse. Data marts exist for two reasons. First, it limits information security vulnerability (i.e., not everyone sees, or has access to, all the data). Second, it reduces the complexity of the data for end-users because they don't see anything they don't need.

    Although data marts do serve those two purposes, they may be more of an artifact of legacy systems rather than an intentional architecture. What does that mean? Well, data warehouses initially formed because individual organization silos (marketing, accounting, HR, etc.) began developing their own mini analytical databases to aid their decision making. As organizations realized the value in these mini analytical databases, they realized that the data storage should be centralized. As a result, in many cases, data warehouses were a "backtracked" technology designed to share the analytical data being stored by each organizational department.

  • Descriptive Analysis

    Descriptive data mining refers to the set of tools and procedures designed to analyze data in ways that describe the past and immediate present state of the business processes that the data are produced from.

    Descriptive tools and techniques comprise most of those that people refer to when they speak of "data mining" or "BI". The idea here is to take large amounts of data (hence the term "big data") and summarize it into either a set of key performance indicators (KPIs) or ad-hoc measures.

    Key performance indicators are pre-planned measures that have been carefully determined to indicate the organization's performance on a particular business process. The reason for the descriptive "key" is that these performance indicators have been determined to be those which are most crucial to a business's success. Because KPIs are so important, management typically wants to see them in real-time and have them available for review at any time. Therefore, KPIs are summarized into a user interface known as a dashboard. Good dashboards (i.e., more expensive dashboards) also give managers tools for exploring the KPIs in more detail and breaking them down into more specific measures. See the dashboard examples below and try to determine the KPI(s) they are depicting.

    Types of Information Systems

    The dashboard above (found at http://www.brainsins.com/en/blog/dashboard-for-e-commerce/2147) is likely used by a sales manager who is tracking the performance of their product line. There are several key components. First, notice the most important KPI to a sales manager: revenue. Revenue is even depicted in an "automobile-like" dashboard form (which is where the term "dashboard" came from) in that it looks like a speedometer. Second, notice the OLAP cube . OLAP stands for "online analytical processing." OLAP cube is a common term used to describe the data table in the lower-left corner of the image above. Typically, the user can control the cube and change the dimensions that the data is summarized into. You'll learn to do something very similar (but on a much smaller scale) when you create Excel Pivot Tables in the next chapter. Third, notice that there is a prediction of future sales in the upper-right corner. Although we are currently talking about descriptive tools and analyses, this chart of future sales is an example of a more advanced type of predictive analysis that will be discussed in the next section. However, as you can see, dashboards can include both descriptive and limited predictive elements. Yet, their primary purpose is to focus on KPIs like the revenue visualization in the top center of the image above.

    Often, managers will face new problems or strategy sessions where they need to create new data analyses and ad-hoc measures (those created "on the fly" to help in decision making; typically associated with an ad-hoc database query). This is a primary reason that Microsoft Excel is such a mainstream tool at every level in an organization. It has many features for data cleaning and analysis on the fly. Excel provides a nice platform for downloading relatively small amounts of data and creating (or experimenting) with new measures. Often, KPIs are created in Excel before they become true KPIs and are shifted into a more permanent online, web-based dashboard. Later, you will learn several very useful Excel features for data analysis: PivotTables (descriptive), Solver (predictive), and the Statistics ToolPak (both descriptive and predictive features).

  • Predictive Analysis

    As it turns out, it's not all that difficult to calculate measures of our past performance. However, what if we can predict the future? Then, we can outsmart the competition by making the products that consumers really want (perhaps before they know they want them), recognizing viruses before they are a "known" virus, knowing if a consumer is going to buy our product before we give them the sales pitch, and much much much more. This is the purpose of predictive tools and techniques.

    Predictive data analysis requires relatively complex statistical formulas that use historical data to make predictions about the relationships between sets of variables. There are many forms of, and formulas for, these analyses. We'll review the main ones here.

    Detecting Categories

    Detecting categories is the process of "clustering" records (remember records = rows = instances) in a database into groups of related records. For example, consider the number set: 1,1,1,1,2,2,5,5,5,5,6,6,6,6,9,9,9,9,10. How many clusters of related numbers are there? Hopefully you said three. Let's assume those numbers refer to the number of products that each of a set of customers purchased during their last visit to our website. What if we have more data on each customer like the number of days since they made those purchases? That would require us to plot the values of those two variables for each customer like the image below:

    Types of Information Systems

    What if we have three variables for each customer? See image below:

    Types of Information Systems

    What if we have 35 variables for each customer? This is quite possible. However, there's no way to visualize clusters based on 35 dimensions. Yet, statistical algorithms can conceptualize 35 dimensions in your computer's memory and summarize your customers into as few as 2 (or many) basic clusters. Clustering analyses will not only tell you how many clusters were found, but also the primary characteristics (attribute:value pairs) of each cluster. This will allow you to group your customers into segments and create unique strategies for each segment.

    Analyzine Key Influencers

    Key influencer analysis is one of the most common techniques and involves measuring the correlation (i.e., the predictive ability) of a set of independent (a.k.a. "x") variables on typically one dependent (a.k.a. "y") variable. This is a great way to find out, for example, what characteristics of potential customers are related to their level of repeat purchases in your store. For example, as customers have more education, they may be more likely to make a purchase (represented by the line of best fit through the scatterplot below). However, it is very important to be wary of assuming causality even though you may find statistically significant results. For example, it may not be a customer's education that causes them to make purchases; but rather, their education led to greater income which led to purchases.

    There are several statistical formulas used to make these kind of predictions. The image below depicts a regression analysis. Other formulas include Naive Bayes, Decision Trees, Neural Networks, and more.

    Types of Information Systems

    Forecasting

    Forecasting is the process of predicting future values over interval time periods based on known, measured values of the same interval periods. As a result, forecasting always has a standard time period and is charted over time. Sales revenues, profit, costs, and market demand are among the most common measures forecasted over time (i.e., time-series). The ARMA (autoregressive moving average) and ARIMA (autoregressive integrated moving average) formulas are among the most common statistical formulas used for forecasting.

    Types of Information Systems

    Market Basket Analysis

    Market basket analysis is a popular analysis for predicting consumer shopping patterns. In particular, it involves examining the products that have been grouped in the past by consumers (e.g., in a "shopping basket") and using that information to predict related items that each customer may want to purchase based on the shopping baskets of other customers who bought similar products (see image below). The statistical technique used to perform market basket analysis is called "association analysis." If you've ever visited amazon.com, then you've seen market basket analysis as new products are always suggested based on the product you are viewing.

    Other Tools

    The statistical formulas used in the analyses above can also be used to improve other important steps in data analysis. For example, if you have missing consumer data in your analytical database, many statistical formulas will automatically ignore all of that consumer's data because they require complete data to work at all. Let's say that only 8 percent of your customers have completed their entire online profile. It would be very sad to have to ignore the other 92 percent of your customers. So what options do you have? Well, you can start paying for external databases which might be able to fill in the gaps. However, those databases often have the exact same info you have. Another option is to fill in the missing values of each customer with the average of all other customers. While this will allow you to use more of your data, it will likely reduce the strength of your relationships. More recently, a popular technique has been to use the same statistical analysis used in key influencer analysis (e.g., regression) to predict the most likely value of the missing data based on the actual values of all other attributes of the record. For example, if you know that a customer is male, age 20, not a home owner, and works part time, your statistical regression model is likely to also predict that this person has a partial college education. While it is definitely possible that this prediction is wrong, it is much more likely to be accurate than using simply the most common value for education found in the data.

    Another useful technique is to use the clustering algorithms found in category detection tools to identify records that are outliers. For example, your customers may have an average income of $75,000 per year. Therefore, a customer making $200,000 per year may appear to be an outlier. Removing outliers from the data is a great way to improve your predictive power. However, a clustering algorithm would examine all of the other attributes of this seeming outlier in concert and find that because that customer has a graduate degree and 30 years of full-time work experience, they are well within the normal range of customers. Similarly, another customer who earns the exact average of $75,000 would be identified as an outlier if they are 16 years old.

    In summary, statistical formulas and techniques have become a mainstream in today's BI stack. Creating new and useful ways to integrate statistical prediction into your business processes is a great way to save costs, increase revenues, and become noticed by your managers.

  • Characteristics of Good Measures

    In your company, you will almost certainly have many opportunities to explore data, clean data, and create your own unique measures that will help you and your managers make important decisions. In fact, creating new and useful measures of performance is a great way to distinguish yourself from your peers and demonstrate your intellect and usefulness to those you report to. Therefore, it is important that you learn the conceptual characteristics of good measures.

    Simple

    Measures should be easy to understand. The value of measures is helping managers track outcomes and performance. Thus, straightforward and understandable measures are better than complex or convoluted ones. A more specific definition of "simple" would be that few inputs are required to produce the same measure that could be produced with more inputs. However, just because a measure is simple does not mean that it is the right measure if it doesn't provide the meaning you are looking for. However, all other things being equal ( ceteris paribus ), a simpler measure is better.

    Easily Obtainable

    An easily obtainable measure includes two parts:

    1. Ease of Collection - How easy are the measures to collect? Does it require human counting, or is it generated through automation?
    2. Ease of Calculation - How easily can you calculate useful information from the source data? Counts and simple ratios are easy to calculate. Some complex measures that take into account several variables are more complex to calculate. It is often beneficial to automate the collection and calculation tasks when possible; otherwise, it is unnecessarily arduous and costly, prone to error, and less likely to be done. This is especially true of measurements that are computation-intensive or that require an array of data to calculate.

    Precisely Definable

    Measures should be clearly defined so that they can be applied and evaluated consistently. Organizations need to establish and adhere to specific rules when collecting measures and ensure that these rules are being followed uniformly so that the integrity of the data is maintained.

    Objective

    Ideally, two or more qualified observers should arrive at the same value for the measure. Objective measurements will be referred to later when we discuss the need for administration of measurements.

    Valid

    The measure should actually reflect the property it is intended to. Often measures are designed to reflect specific process improvement constructs such as quality and efficiency.

    Robust

    Measures should be insensitive to insignificant changes in the process or product.

    Pulling It All Together

    It is unlikely that the "best" measures will incorporate each of these characteristics to the highest degree possible. Rather, you will likely need to make trade-offs among these traits. As a result, it is common to have multiple quantitative measures (i.e., represented on a dashboard) to depict a concept.

    Consider the following two measures: (1) consumer purchase volume and (2) consumer satisfaction. Purchase volume is fairly simple; define the time frame and add up the purchases. It's easy to collect because you can simply query the operational database. It's precisely definable as long as you have the timeframe. It's objective because it's based on actual measurable behavior and not opinions. It's valid as long as you are recording purchases properly in your databases. It's also robust because if you change the time frame, you can still count on it being accurate.

    Now consider satisfaction. Is there a perfect measure that precisely defines exactly what satisfaction means to everyone? Not likely. Is purchase volume a perfect indicator of satisfaction? Actually, no. What if someone buys your product because they "have to?" Does everyone interpret the number "2" on a 1-5 scale to mean the same thing? Also, asking a survey question is MUCH more costly and difficult to collect than querying a database for purchase volume. Therefore, a smart manager will collect several measures that she or he believes to represent what a customer's satisfaction truly is.