Data Structure

  • The Lifeblood of an Organization

    You've probably heard the common analogy that information is the "lifeblood of an organization." What does this mean?

    1. data does not sit still in an organization -- it's constantly flowing. Every single business process that occurs as part of the organization creates data. This new data is stored and shared among the other business processes that require it. Other data is old, outdated, and therefore deleted. In other words, like blood, data is constantly flowing to the parts of the organization that need it.
    2. data is necessary for the organization to live. If there is not a core set of data that is readily available when needed, then "business" can't get done.
    3. Data must regularly be "cleaned." To the greatest degree possible, it must be "pure" before it enters the database. As with poisoned, diseased, or impure blood, bad data will kill or substantially harm an organization.
    4. Data can be used for two high-level purposes: (a) sustaining business processes, and (b) creating new strategic advantages. We characterize these two purposes as data used to produce either transactional information or analytical information.
  • Information Types

    Transactional information is data that is generated by real-time business processes. "Real-time" means that the data is recorded as the transaction occurs, like ATM transactions, customer checkouts, or airline reservations.

    Transactional data are stored in what we call an operational database. Operational databases are those required for everyday business processes or operations. Therefore, it is critical that operational databases remain running and available at full speed. Otherwise, the business shuts down. Speed and availability are the key performance indicators of an operational database storing transactional data.

    Analytical information refers to any data or information that can be used to make intelligent inferences relevant to the organization, help solve unstructured business problems, and support higher-level, ad-hoc decision making. While transactional information simply helps you perform your business processes, analytical information is used to help you gain strategic competitive advantage in the marketplace.

    Analytical information is not stored in operational databases. Remember that speed and availability are the key characteristics of operational databases. If managers want to analyze the data in their operational database, they will not mess with the operational database and potentially slow down or crash the business. Therefore, they will copy all of the transactional data from the operational database, transform it into a summarized form, and load it into a data warehouse where managers can analyze the data to their heart's content.

  • Information Granularity

    Granularity is defined as the scale or level of detail present in a set of data. Information granularity is a spectrum from Coarse to Fine. Fine also known as ("Detail") granularity means that the information is in raw data form (e.g. name=Jane, age=30). Summary takes it up one level by summarizing the transactional information into analytical information (e.g. Jane's weekly purchases total $256.78). Coarse (also known as "Aggregate") granularity means that the data has been summarized even further ( e.g. the entire company's weekly sales = $2,567,400.12).

    Functional Areas
  • Information Quality

    Just like poor blood leads to poor health, poor-quality data leads to poor company health. There are five characteristics of high-quality information for which you must strive: accuracy, completeness, consistency, uniqueness, and timeliness.

    1. Accuracy - Is the information correct? Is it spelled correctly? How do you make sure? Programmers and database developers can create controls that can help in this regard. For example, we can make you select a state from a dropdown list of available options instead of having you type the state yourself. Any data input that has a known option set can be validated this way.
    2. Completeness - Is all or part of a value missing? If you ask for a phone number, make sure there are 10 digits. If you ask for an address, make sure it includes the zip code.
    3. Uniqueness - Is the exact same piece of data or information stored multiple times unnecessarily? This does not refer to the situation where a customer name is stored both in an operational database (as transactional information) as well as a data warehouse (as analytical information). It means that we've created multiple customer accounts for the same customer in the same "customers" database, or the same order is "rung up" twice by accident.
    4. Consistency - Is the total of a customer's weekly purchases (stored in an analytical database) equal to the sum of the items on the receipts (stored in the operational database)? Consistency refers to our own poor programming or poor database administration. What if, when programmers write the code to summarize a customer's weekly sales, they perform the algorithm incorrectly? In that case, there is an inconsistency between the operational database and the data warehouse. This is also relatively common and can absolutely ruin data analysis.
    5. Timeliness - Is the difference between the current time and the time when my data was last updated as small as reasonably possible? Although newer data is always preferred, more frequent updates means higher costs.

    You should not cut corners when it comes to information quality, but don't forget the economic principle of satisficing established by management theorist Herbert Simon (1956). In decision making, the principle of satisficing means that it is not always optimal to make a "perfectly informed" decision because there is a cost of obtaining perfect information. At some point, the cost of greater information quality exceeds the value of that quality. So, what should you do? You should make sure you know exactly how much quality your information has. When using it to make predictions, make sure you account for the level of quality.

  • Relational Databases

    A Database is a collection of data organized for search and retrieval. So, what can be a database? Based on this definition, a simple piece of paper with a list of items could be a database if it allows data sharing and retrieval. The term database management systems (DBMS) refers to the software system used to store data electronically and provide advanced features for ensuring and maintaining data quality as well as managing the security and availability of the data.

    So, how can a DBMS help keep your data quality high? This is the role of a relational database. Relational databases solve the problem of data redundancy. Redundancy means that the same data is recorded multiple times. Not only is this unnecessary, it creates opportunities for inconsistency. To understand the problem of redundancy and how it is solved with relational databases, watch the following video. It introduces the concept of primary and foreign keys.

    This next video goes deeper into how primary keys and foreign keys are used to divide up data in an order processing example.

    DBMSs offer a variety of features designed to help ensure high quality of data:

    1. they eliminate redundancy by allowing separate related tables
    2. they provide security features
    3. they allow data to be shared among MANY users at the same time
    4. they offer a variety of business-critical integrity constraints

    Integrity Constraints are features of database fields which allow only "clean" data to be entered. For example, you can require that zip codes are always 5 digits long (length=5, data type=numeric). You can make sure that salaries are positive (value >=0). You can be sure that dates are only selected from date pickers so that they are formatted correctly, and more.

  • Relational Databases in MS Access

    You may have already had experience creating tables in MS Access. But let's make sure you remember how to ensure the quality of data inputs and also create the relationship among tables.