Data Staging
Planning
GoaL of data staging is the get the right data from sources to a data mart.
Best way to do data staging?
- Round up the requirements - conceptual model
- Consider the business needs
- Study the sources
- Look out for data limitations
- Decide on scripting languages
- Look at the staff skills
- Remember legacy licenses
The data staging steps
A: Planning
- High level Planning
- Detailed planning: dimension management, error handling, fact table construction, etc.
B: Develop one-time historic load C: Develop incremental Load
Step A1: High-level Planning
Create a very high-level, one-page schematic of the source-to-target flow
- Identify starting and ending points
- Label known data sources
- Include placeholders for sources yet to be determined
- Label targets
- Add notes about known problems
Step A2: Detailed planning by table
Drill down by target table, graphically sketching any complex data restructuring or transformations
Identify attribute hierarchies (normalize the source)
- A hierarchy is a relationship between attributes where the data rolls up into higher levels of summarization in a series of strict many-to-one relationships. Hierarchies are reflected by additional columns in a dimensional table.
- The inverse of a roll up is a drill down
- Data staging issue, make sure hierarchies are clean (e.g., products in a store)
Graphically illustrate the surrogate-key generation process
Develop a preliminary job sequencing
Developing One-Time Historic Load
- Build and test dimensions load
- Build and test fact table load (develop surrogate key pipeline)
Step B1: Populate Dimension tables
- Static, offline dimension extract
- Creating and moving the result set
- Data compression
- Data encryption
- Transformations
- Simple data transformations, change of data types
- Handling NULLS, how
- Consolidation and deduplications
- Surrogate key assignment
- Use integer “autonumbers”, increasing by 1
- Maintain a lookup table with the production_key to surrogate_key matches in data staging area
- Validating one-to-one and one-to-many relationships
- Constraint: Only one model per product
Step B2: Populate dimensions - database considerations
- Load
- Turn off logging
- Pre-sort the file
- Transform with caution
- Use the bulk loader
Changes to historic data
For simple scenarios where only one or two columns (attributes of a dimension) change then we can simply choose to overwrite it.
Or we can keep track of history, by adding a new entry to the table and changing the primary key.
Step B2: Populating the historic fact table
- Decide how to handle NULL values in your measures, if any
- Remember the important row, NULL
- in relevant dimensions
- E.g., no Promotion during Sales
- in relevant dimensions
- Make absolutely sure your Ref. Integrity is solid
Develop Incremental Load
Step 1: Record dimensional records
Identify the changed dimension rows, and compare differences & apply policy decisions.
Step 2: Incremental slowly changing dimensions
Handling change for slowly changing dimensions:
- Type 1: overwrite
- Type 2A: add new row
- Type 2B: add new row & flag
- Type 2C: add new row & dates
- Type 3: Add new attributes
- Type 4: Add mini dimensions
- Type 5 to 7: Hybrid and complex to code
Step 3: Incremental fact table staging
Add new transactions like a smaller unit of measurement.
For late arriving measures/facts, identify the correct surrogate keys to link to.
For performance:
- Speeding up the load cycle
- More frequent loading
- Partitioned files and indexes
- Parallel processing
Step 4: Aggregate Table and OLAP Loads
- Build aggregates
- Maintain aggregates
- Prepare OLAP loads (if any)
- Cube-like structure based on dimensional model
- MOLAP engines build own optimized aggregates
Last step: Automation
- Typical operational functions
- Job definition: flow and dependency
- Job scheduling: time and event based
- Monitoring and Notification
- Exception and Error handling
- Determine job control approach
- Record extract metadata
- Record operations metadata
- Ensure data quality
- Set up archiving in the data staging area
- Develop data management procedures
Data Quality
Assessing the quality of data:
- flag normal, abnormal, out of bounds or impossible facts
- recognize random or noise values from context and mask out
- apply a uniform treatment to NULL values
Steps to transform data
- Data cleaning
- Data integration and transformation
- Data reduction
Design decision: done during data staging or by user applications (or at both ends)
The above steps all depend on domain, organization culture, end user needs and skills
Tasks to clean data
- fill in missing values
- fill manually
- use default value
- use mean value
- use mean value of class or grouping
- use most probable value
- identify outliers and smooth out noisy data
- binning
- sort data and partition into equal frequency bins
- use bin means, or bin median or bin boundaries
- regression
- fit the data to a function using linear or multiple linear regression
- clustering
- useful for finding outliers
- should involve human inspection
- binning
- correct inconsistent data
- resolve redundancy caused by data integration