November 2004
Dimensions are Not Entities, or Why Data Structures Aren’t Equal
It takes some mental adjusting to adapt to data warehousing methods of data structure design. Both first-time data warehousers and experienced folks, who have kept their noses a little too close to that proverbial grindstone, can run into problems when attempting a query against a lone dimension table. These problems occur because a dimension is not the same thing as an entity. The previous statement is true even if the dimension is implemented as a table in a relational database. A dimension is simply not the same thing as a 3rd normal(-ish) form relational table. When an unsuspecting individual attempts to query a dimension table as if it were just another table, more often than not the querier will jump to the conclusion that something is wrong -- because the answers do not fit the expectations. A normalized table includes data that is functionally dependent on the key of the table. Under this dependency, data is fairly narrow in scope. Alternately, a dimension‘s scope could be fairly broad. Superficially, the design of a dimension table should, after a fashion, place everything into a functional dependence on the dimension‘s generated primary key. But while confusingly close, these two concepts (normalized functional dependence vs. generated dimensional dependence) are not the same thing. In the normalized table, included data functionally depends on the identity of an occurrence of the object described by that table. Therefore, contained data should be related closely to that single kind of object regardless of whether it is people, animals, or postage stamps. The dimension table is different; its primary key is jerry-rigged, specifically so that the key fits the needs of the data. Data items may be closely related to a single object; or a range of hierarchies in the ballpark of the object; or even varying groups of otherwise unrelated items that work more conveniently for some kind of query when mashed together. As an example, a "customer" dimension could contain direct customer-dependent data items; or it could contain loosely related data items under a broad category referred to as "customer," such as items that might be found in normalized tables for customer, corporation, customer locations, customer representative, etc.; or even sales hierarchy data and shoe size descriptions shoved together with some customer elements designated as a group named "customer." Clearly, the dimension has purposefully put that primary-key-cart right there in front of the dependency-horse, which you may have noticed is indeed a horse of a different color. On one hand, a normalized customer table would definitely lead one to believe that each row within the table represents a customer; on the other hand, the same cannot be said of a customer dimension. A dimension is designed so that valid answer sets are obtained through grouping or selecting values from inside the columns within the dimensions; there is no reliance on a value existing on a single row within the dimension. Due to the varying needs for managing the dimension content via slowly changing dimension logic, the number of rows found within the dimension table only indicates how often certain kinds of data change their values -- a trivial number, generally of limited interest. In order to emulate a normalized result against these dimensions, queries often must be formatted to return a count of distinct values within individual columns or column groupings, instead of performing the generic row counts usually executed. Therefore, people attempting to query a dimension as if it were no different than a normalized structure observe results that may be misleading. And in being misled, the user will look to the dimensional construction as being in error. All data structures are not created equal. Consider this the next time you need to write a query: look, and evaluate, before you leap. Dimensionalized designs are different, and that difference is not an accident. Multidimensional structures conform to a strict arrangement, albeit not a normalized one.
Todd Schraml is a contributing writer to DBTA and Data Architect at Innovative Health Strategies, a leading business intelligence services provider to pharmaceutical manufacturers through management and measurement of data related to contracting, marketing, and revenue generation strategies. He has many years experience in application development and maintenance; including over a decade focused on data warehousing, and several years experience in database administration on massively parallel processing database management systems. Across many projects and several industries, Todd has been Data Warehouse Architect, Technical Lead, DBA, Business Analyst, Developer, and Teacher. He lives in the vibrant Midwest and can be reached at [email protected]. |
2004 Database Elaborations Archives:
2003 Database Elaborations Archives:
2002 Database Elaborations Archives:
|