Duplicating measures as dimensions in data warehouse design

Published

10.3.2019

Products mentioned

Data Lakes + Warehouses

The definition of a measure and that of a dimension are basic premises of data warehouse design. You might have learned the difference in a class or from someone else who got you started doing some data modelling for a reporting project. Dimensions are the attributes in a design that serve as the “group by”, or row and column header, values on reports and other analysis aids produced from the warehouse. Measures are the numbers that get summarized at the intersection of one or more dimension values, for example in a grid style report at the intersections of the rows and columns.

Another rule of thumb for determining your dimensions and measures is by how “discrete” they are, i.e. how short their lists of possible values are. Dimension attributes typically have very discrete lists of values. For example, the “State” field in a database that contains only locations in the United States may have well short of 100 possible values including all US states and territories. This is a good candidate for a dimension. Measure attributes commonly have continuous (not discrete) sets of possible values. A good example of a measure is the total amount on a sale transaction. This value will likely be expressed in dollars and cents (or other currency). Depending on the industry we’re looking at, the number of possible values for total amount might easily number in the millions!

Like any rule of thumb, this one has its’ limits. You may come across an attribute that has exactly two possible values, zero and one, but that serves as a measure since you see it used entirely for sums, averages or other aggregations in analyses. What about those highly continuous attributes like sales totals — do we ever have a good reason to break our rule of thumb when dealing with them? The answer turns out to be yes, but you need to be careful. Asking some clarifying questions of your users can help ensure that you’re breaking the “discrete” rule for sound reasons.

Let’s suppose that you’re working with a business that lets its’ customers make high cost purchases using installment plans. In such a case the customer conducts a sale transaction that records two amounts: the total amount (which the customer has not actually paid) and a down payment amount. Installment plan sale transactions are mixed in with normal sale transactions where the customer pays the total amount up front. You learn from your users that they have two very different use cases for reporting on the installment plan amounts from sale transactions. The first use case is normal: they need to report the total amounts from the sale transactions, regardless of whether those are installment plan or “regular” sale transactions. To satisfy the first use case you are expecting to have a sale transaction “fact” table with several measures that express the transaction amounts.

The second use case is a little unusual: they need to be able to characterize customers who have opened installment plans in terms of their installment plan’s initial ratio of down payment to total amount. We’ll assume that each customer can have just one installment plan open at a time.

Perhaps your users haven’t figured out precisely how they want this ratio expressed but they know they need it. With this second use case in mind, you may choose to take the down payment and total amounts from the initial installment plan transaction for a customer and write them somewhere as dimensions, while also leaving them as measures on your sale transactions. A couple of questions need to be asked of your users: does it make sense to keep track of the information they need here in a customer context (i.e. is it single-valued at any given point in time for a customer), and does this information change infrequently or not at all? For our example, let’s suppose that the answers to both questions are yes. Then we may safely place an installment plan down payment amount and installment plan total amount on the customer dimension in our design. We may also want to write down installment plan initial transaction date alongside these for reference. If we determine that a customer may have multiple installment plans over time (by paying off one plan and subsequently opening another), then we may choose to implement slowly-changing dimension (SCD *) logic for these three attributes.

With the down payment and total amounts recorded for those customers who have installment plans, it’s now a simple matter to calculate a down payment ratio. However, one problem remains: the ratio doesn’t make a very good “group by” for analysis and reporting purposes because it is still a continuous value. A common solution to this problem is presenting a “banded” value for analysis and reporting instead of the straight ratio. Users should be consulted on how to present banded values, but a common paradigm has the values banded in 10-percent ranges, i.e. “1–10%”, “11–20%”, “21–30%”, “31+%” or something similar.

I hope this example has helped present a useful design option for some who will read it!

*SCD attribute handling is a bit too much to cover here, but plenty of solid write-ups of the technique can be had with a simple search online.

Ajit is an AWS Certified Solutions Architect and a member of the One Six Solutions team.