Conceptual Design

View on GitHub

Steps to create a Data Mart

Dimensional Model

Components

FACT table: Primary table where numeric performance measures for a business process are stored
- Composite PK from many FKs
- Facts/measures: A business measure (numeric, additive)
Dimensional tables
- contains textual description of business
- many, many, many dimensional attributes
- used to specify query constraints

Designing an Individual Dimensional Model

Step 1: Choose the Business Process to Model

Almost always something to do with the "Sales" and/or "Productivity"
- Retail
- Healthcare
- Insurance
- Telecoms
Motivated using $$$$ saved (or gained)

Step 2: Declare the Gain

Answer question: "How do you describe a single row in the fact table"
General guiding principle:
- we choose the lowest meaningful (and feasible) grain of each dimension
Why:
- we want to see all those details
- we can always aggregate (sum, count, average) later

Step 3: Choose the Dimensions

Answer question: "How does the business describe the data that results from the business process?"
This is determined by grain of fact table
add all possibly relevant dimensions and many, many descriptive attribute values

Step 4: Choose the Facts/Measures

Answer question: "What are we measuring?"
Specific to grain of fact table
Store additive values
Percentages and ratios, also store numerator + denominator

Types of Dimensions

Causal: promotion, contract, deal
Degenerate: ticket number, order number
Status: account status, shipment status
Audit: data quality and record lineage
Junk: indicators and flags
role-playing: one dimension acting in many "views" of the data

Role-playing dimensions

keys with different roles showing the same data
you can implement a role-playing dimension using a SQL View

Modelling multiple units of measure

To make sure there is no confusion, we have units of measure in the data schema design. For example if a product has a length field, then we also store a length-unit field.

We could also design our schema to store the length in all units of measure, so a length-cm field or length-inches.

Modelling multiple currencies

We keep a local currency price as well as a field for equivalent currencies in our schema.

When do we measure the conversation rate?

We convert the local currency based on where the data mart is located. We would use the data mart's local time.

Modelling many-to-many Dimensions

0, 1 M: exact number unknown before creating dimensional model
use a bridge entity
useful for easy querying
dimensional model won't be a star anymore
use with caution
implemented during data staging

Monster Dimension

A monster dimension changes frequently, leading to an excessive number of rows.

Use banding as a solution. Think of banding like a range. Leads to better analytics and better data mining, both occur sometimes.

Slowly changing dimension

Let some fields be overwritten if incorrect data is inputted.

Alternative Designs

Stars vs Snowflakes

snowflaking happens when we choose to normalise a dimension.
the golden rule: avoid as far as possible.
why? recall cost of joining tables

Galaxies or Fact Constellations

These are data marts share dimensions.

Implementation issues

conformed dimensions
conformed measures
conformed: exactly the same, or a rollup / drill down
aim: avoid stovepipes
golden rule: when we aggregate (sum or count), we get the same totals

Fact constellations revisited

Galaxies data integration issues

conflicting dimensional attribute meaning
conflicting measure meaning
our aim conformed, exactly the same or a rollup (concept hierarchies)
golden rule:
- when we aggregate, we get the same dimensions

Conformed dimensions

Attributes in separate dimensional tables (associated with different have same names and domains)
Goal: Analytic consistency
Same key values, same data definitions, same attribute values

Productivity

Problem: conflicts during data integration

Dimensions should match exactly between two sources

Approach: find common ground - (intersection)

Granularity

Conflicts during data integration, missing details

Solution: add more detail, refine granularity

Conformed measures

Technical definition should be exactly the same, if they are going to be compared or aggregated

Handling multiple grains

Demographic data is aggregated data used to describe a population

used for customer segmentation
use Outrigging for low cardinality attributes
Why? different grain, different load cycles avoid monster dimensions

Outrigging

Downwards allocation

When you have multiple grains the general idea is to allocate line items using a standard approach

Modeling mistakes

Don'ts:

Place text attributes in the Fact table
Limit verbose descriptions to save space
Avoid snowflaking. Normalize to save space (leads to slower queries)
Ignoring the need to track changes
Add new hardware to solve all query performance issues
Use operational keys as primary keys
Neglect to declare and comply with grain
Neglect a detailed design
Expect users to query normalized data
Fail to conform Facts and Dimensions

Key Concepts and Definitions

Data Staging