Data Staging

Planning

GoaL of data staging is the get the right data from sources to a data mart.

Best way to do data staging?

  1. Round up the requirements - conceptual model
  2. Consider the business needs
  3. Study the sources
  4. Look out for data limitations
  5. Decide on scripting languages
  6. Look at the staff skills
  7. Remember legacy licenses

The data staging steps

A: Planning

  1. High level Planning
  2. Detailed planning: dimension management, error handling, fact table construction, etc.

B: Develop one-time historic load C: Develop incremental Load

Step A1: High-level Planning

Create a very high-level, one-page schematic of the source-to-target flow

  • Identify starting and ending points
  • Label known data sources
  • Include placeholders for sources yet to be determined
  • Label targets
  • Add notes about known problems

Step A2: Detailed planning by table

  • Drill down by target table, graphically sketching any complex data restructuring or transformations

  • Identify attribute hierarchies (normalize the source)

    • A hierarchy is a relationship between attributes where the data rolls up into higher levels of summarization in a series of strict many-to-one relationships. Hierarchies are reflected by additional columns in a dimensional table.
    • The inverse of a roll up is a drill down
    • Data staging issue, make sure hierarchies are clean (e.g., products in a store)
  • Graphically illustrate the surrogate-key generation process

  • Develop a preliminary job sequencing

Developing One-Time Historic Load

  1. Build and test dimensions load
  2. Build and test fact table load (develop surrogate key pipeline)

Step B1: Populate Dimension tables

  • Static, offline dimension extract
  • Creating and moving the result set
    • Data compression
    • Data encryption
  • Transformations
    • Simple data transformations, change of data types
    • Handling NULLS, how
    • Consolidation and deduplications
  • Surrogate key assignment
    • Use integer “autonumbers”, increasing by 1
    • Maintain a lookup table with the production_key to surrogate_key matches in data staging area
  • Validating one-to-one and one-to-many relationships
    • Constraint: Only one model per product

Step B2: Populate dimensions - database considerations

  • Load
    • Turn off logging
    • Pre-sort the file
    • Transform with caution
    • Use the bulk loader

Changes to historic data

For simple scenarios where only one or two columns (attributes of a dimension) change then we can simply choose to overwrite it.

Or we can keep track of history, by adding a new entry to the table and changing the primary key.

Step B2: Populating the historic fact table

  • Decide how to handle NULL values in your measures, if any
  • Remember the important row, NULL
    • in relevant dimensions
      • E.g., no Promotion during Sales
  • Make absolutely sure your Ref. Integrity is solid

Develop Incremental Load

Step 1: Record dimensional records

Identify the changed dimension rows, and compare differences & apply policy decisions.

Step 2: Incremental slowly changing dimensions

Handling change for slowly changing dimensions:

  • Type 1: overwrite
  • Type 2A: add new row
  • Type 2B: add new row & flag
  • Type 2C: add new row & dates
  • Type 3: Add new attributes
  • Type 4: Add mini dimensions
  • Type 5 to 7: Hybrid and complex to code

Step 3: Incremental fact table staging

Add new transactions like a smaller unit of measurement.

For late arriving measures/facts, identify the correct surrogate keys to link to.

For performance:

  • Speeding up the load cycle
  • More frequent loading
  • Partitioned files and indexes
  • Parallel processing

Step 4: Aggregate Table and OLAP Loads

  • Build aggregates
  • Maintain aggregates
  • Prepare OLAP loads (if any)
    • Cube-like structure based on dimensional model
    • MOLAP engines build own optimized aggregates

Last step: Automation

  • Typical operational functions
    • Job definition: flow and dependency
    • Job scheduling: time and event based
    • Monitoring and Notification
    • Exception and Error handling
  • Determine job control approach
  • Record extract metadata
  • Record operations metadata
  • Ensure data quality
  • Set up archiving in the data staging area
  • Develop data management procedures

Data Quality

Assessing the quality of data:

  • flag normal, abnormal, out of bounds or impossible facts
  • recognize random or noise values from context and mask out
  • apply a uniform treatment to NULL values

Steps to transform data

  1. Data cleaning
  2. Data integration and transformation
  3. Data reduction

Design decision: done during data staging or by user applications (or at both ends)

The above steps all depend on domain, organization culture, end user needs and skills

Tasks to clean data

  • fill in missing values
    • fill manually
    • use default value
    • use mean value
    • use mean value of class or grouping
    • use most probable value
  • identify outliers and smooth out noisy data
    • binning
      • sort data and partition into equal frequency bins
      • use bin means, or bin median or bin boundaries
    • regression
      • fit the data to a function using linear or multiple linear regression
    • clustering
      • useful for finding outliers
    • should involve human inspection
  • correct inconsistent data
  • resolve redundancy caused by data integration