Introduction to Multi-dimensional Model

A data structure optimized for data analysis

Purpose: Enable complex analytical and ad-hoc queries with rapid execution time
Key components: dimensions, measures, facts

Dimensions

Descriptive attributes by which facts are analyzed

Examples: time, product, customer, location

Dimensions are tables containing:

Attributes: Descriptive properties of a dimension
- product name, color, size
Hierarchies: Logical structures within a dimension that support different levels of granularity
- Example: Year > Quarter > Month > Day

Dimensions in Data Warehousing

Definition

Dimensions are descriptive attributes used to analyze facts in a data warehouse. They provide the context for numerical measures (facts) and enable various types of analysis.

Key Characteristics

Serve as the foundation for querying and filtering data
Organize data into hierarchies, allowing for drill-down and roll-up operations
Contain textual descriptions that give meaning to the numerical facts

Common Examples

Time: Allows analysis of data over different time periods (e.g., day, month, quarter, year)
Product: Enables analysis of data by product categories, individual products, etc.
Customer: Facilitates analysis based on customer attributes (e.g., demographics, behavior)
Location: Supports geographical analysis (e.g., country, region, city)

Attributes and Hierarchies

Attributes: Descriptive properties of a dimension (e.g., product name, color, size)
Hierarchies: Logical structures within a dimension that support different levels of granularity
- Example: Time hierarchy - Year > Quarter > Month > Day

Types of Dimensions

1. Conformed Dimensions

Definition: Dimensions that are shared across multiple fact tables or data marts
Purpose: Ensure consistency and enable integrated analysis across the entire data warehouse
Example: A customer dimension used in both sales and support fact tables

2. Role-playing Dimensions

Definition: A single dimension used multiple times in a fact table, each time with a different context
Purpose: Reduce redundancy and save storage while maintaining logical distinctions
Example: A date dimension playing roles like order date, ship date, and delivery date in an order fact table

3. Junk Dimensions

Definition: A dimension that combines several low-cardinality attributes (flags or indicators) into a single dimension
Purpose: Simplify the dimensional model by reducing the number of small dimensions
Example: Combining order status, shipping method, and payment type into a single dimension

4. Degenerate Dimensions

Definition: Dimensional attributes stored in the fact table rather than in a separate dimension table
Purpose: Improve query performance for attributes that are used primarily for grouping facts
Example: Order number or transaction ID stored directly in the fact table

Best Practices

Design dimensions with the end-users' analytical needs in mind
Ensure dimension tables contain rich, descriptive attributes to support various types of analysis
Use meaningful, business-friendly names for dimension attributes
Regularly update slowly changing dimensions to maintain data accuracy
Document the structure and meaning of each dimension thoroughly

Measures

Numerical facts to be analyzed

Types of measures:
- Additive: Can be summed across all dimensions
- Semi-additive: Can be summed across some dimensions
- Non-additive: Cannot be summed meaningfully
Derived measures and calculated members

Measures in Data Warehousing

Definition

Measures are numerical facts to be analyzed in a data warehouse. They represent the quantitative data that users want to examine and analyze across various dimensions.

Types of Measures

1. Additive Measures

Definition: Measures that can be summed across all dimensions
Characteristics:
- Can be aggregated meaningfully along any dimension
- Most common type of measure in data warehouses
Examples: Sales amount, quantity sold, revenue

2. Semi-additive Measures

Definition: Measures that can be summed across some dimensions, but not all
Characteristics:
- Often meaningful when aggregated over some dimensions (e.g., products) but not others (e.g., time)
- Require careful consideration when aggregating
Examples: Account balance, inventory levels

3. Non-additive Measures

Definition: Measures that cannot be summed meaningfully across any dimension
Characteristics:
- Often ratios, percentages, or averages
- Require special handling in analysis and reporting
Examples: Profit margin percentages, average prices, ratios

Derived Measures and Calculated Members

Derived Measures:
- Measures calculated from other measures or dimensional attributes
- Computed at query time rather than stored in the fact table
- Example: Profit (derived from Revenue - Cost)
Calculated Members:
- Similar to derived measures but defined within the dimensional structure
- Can involve complex calculations and business logic
- Example: Year-to-date totals, moving averages

Best Practices

Clearly identify and document the type of each measure (additive, semi-additive, non-additive)
Design fact tables with primarily additive measures for optimal performance
Use appropriate aggregation methods for semi-additive and non-additive measures
Consider pre-calculating complex derived measures for performance reasons
Ensure calculated members are well-documented and understood by end-users
Regularly validate the accuracy of derived measures and calculated members

+----------------------------+
|         Sales_Fact         |
+----------------------------+
| Customer_ID (FK)           |     +------------------------+
| Product_ID (FK)            |---->|   Customer_Dimension   |
| Time_ID (FK)               |     |------------------------|
| Quantity_Sold (Measure)    |     | Customer_ID (PK)       |
| Sales_Amount (Measure)     |     | Customer_Name          |
+----------------------------+     +------------------------+
            |                                  
            v
+----------------------------+
|        Time_Dimension      |
+----------------------------+
| Time_ID (PK)               |
| Year                       |
| Month                      |
+----------------------------+

Facts

Collection of related data items, consisting of measures and context

Types of fact tables:
- Transaction fact tables
- Periodic snapshot fact tables
- Accumulating snapshot fact tables
Granularity of facts
Relationship between facts and dimensions

Facts in Data Warehousing

Definition

Facts are collections of related data items, consisting of measures and context. They represent the core data to be analyzed in a data warehouse, typically stored in fact tables.

Types of Fact Tables

1. Transaction Fact Tables

Description: Represent individual transactions or events
Characteristics:
- Finest grain of detail
- One row per transaction
- Usually the most voluminous
Example: Individual sales transactions, ATM withdrawals

2. Periodic Snapshot Fact Tables

Description: Capture the state of things at regular, predetermined time intervals
Characteristics:
- Regular time intervals (e.g., daily, weekly, monthly)
- Consistent level of aggregation over time
- Good for analyzing trends over time
Example: Monthly account balances, daily inventory levels

3. Accumulating Snapshot Fact Tables

Description: Track the progress of a process with a definite beginning and end
Characteristics:
- One row per process instance
- Updated as the process progresses
- Contains multiple date columns for different milestones
Example: Order processing (order date, shipment date, delivery date)

Granularity of Facts

Definition: The level of detail represented by each row in a fact table
Importance:
- Determines the types of analyses that can be performed
- Affects the size and performance of the data warehouse
Best Practice: Choose the lowest level of granularity that is practical and meaningful for the business
Example: Individual product sales vs. daily total sales by store

Relationship between Facts and Dimensions

Structure: Facts are typically surrounded by dimensions in a star or snowflake schema
Connections:
- Facts contain foreign keys that link to dimension tables
- These links allow for rich, multi-dimensional analysis
Dimensionality: The number of dimensions associated with a fact table determines its dimensionality
Analysis: Dimensions provide the context for analyzing the measures in the fact table

Best Practices

Choose the appropriate fact table type based on business requirements and analysis needs
Determine the optimal granularity that balances detail with performance
Ensure consistency in the level of granularity across related fact tables
Design fact tables to be as narrow as possible, including only necessary columns
Use surrogate keys for dimension references to improve performance and handle changing dimension data
Document the meaning and context of each fact table thoroughly

The Cube Concept

Multi-dimensional representation of data

Visualizing multi-dimensional data
Basic operations:
- Slicing
- Dicing
- Pivoting

The Cube Concept in Data Warehousing

Definition

The cube concept refers to a multi-dimensional representation of data in a data warehouse. It allows for the visualization and analysis of data across multiple dimensions simultaneously.

Visualizing Multi-dimensional Data

Three-dimensional cube: Often used to represent data with three dimensions (e.g., Product, Time, Location)
Hypercube: Represents data with more than three dimensions
Cells: Intersection points in the cube, containing measure values
Edges: Represent dimensions (e.g., time, product, location)

Basic Operations

1. Slicing

Definition: Extracting a specific slice of the data cube by fixing one dimension
Example: Analyzing sales for a specific month across all products and locations
Benefit: Allows for focused analysis on a particular aspect of the data

2. Dicing

Definition: Extracting a sub-cube by fixing two or more dimensions
Example: Analyzing sales for a specific product category in a particular region for the last quarter
Benefit: Enables more granular analysis by focusing on multiple specific aspects simultaneously

3. Pivoting

Definition: Rotating the cube to view data from different perspectives
Also known as: Rotation
Example: Changing the view from "Product by Region" to "Region by Product"
Benefit: Provides different analytical perspectives on the same dataset

Additional Important Concepts

Drill-down and Roll-up

Drill-down: Moving from a higher level of aggregation to a more detailed level
Roll-up: Aggregating data to a higher level in a dimension hierarchy
Example: Drilling down from yearly sales to monthly sales, or rolling up from city-level data to country-level data

Aggregation

Process of calculating summary values across dimensions
Types include sum, average, count, min, max, etc.
Essential for providing different levels of data granularity

Benefits of the Cube Concept

Enables intuitive representation of complex, multi-dimensional data
Facilitates quick and flexible data analysis
Supports various levels of data aggregation and detail
Allows for easy identification of trends, patterns, and anomalies
Enhances decision-making by providing multi-faceted views of business data

Challenges and Considerations

Data sparsity: Many cells in a cube may be empty, leading to storage inefficiencies
Performance: Large cubes with many dimensions can be computationally intensive
Design complexity: Deciding on the right dimensions and hierarchies requires careful planning
Data updates: Updating pre-aggregated data in cubes can be complex and time-consuming

Benefits of Multi-dimensional Model

Intuitive data representation
Efficient query performance
Flexibility in analysis

Multi-dimensional vs. Relational Model

Aspect	Relational Model	Multi-dimensional Model
Primary Purpose	Operational processing (OLTP)	Analytical processing (OLAP)
Data Structure	Normalized tables	Denormalized, star or snowflake schema
Optimization	For data insertion and updates	For complex queries and aggregations
Query Complexity	Simple, predefined queries	Complex, ad-hoc queries
Data Redundancy	Minimized	Accepted for performance
Time Dimension	Usually represents current state	Historical data is a key aspect
Data Volume	Typically smaller	Usually much larger

In-Class Exercise

A retail store chain tracks its sales data to analyze business performance. The store collects the following information for each sale:

Product Information:
Product name, Product category (e.g., electronics, clothing), Price
Customer Information:
Customer ID, Customer age group (e.g., 18-25, 26-35), Gender

Store Information:
Store location (city), Store region (e.g., North, South)
Sales Information:
Sale date, Quantity sold, Total sales amount (Quantity sold * Price)

Task:

Identify the Dimensions (descriptive attributes)
Identify the Measures (quantitative values)
Define the Fact Table (the main business event and associated facts)

Aspect	Data Staging Area	Integration Layer
Purpose	Temporary workspace for extracting and transforming raw data	Consolidates and integrates data for unified, consistent datasets
Functions	Extraction, cleansing, transformation, loading	Data harmonization, applying business logic, consolidation
Nature	Temporary data storage	Permanent, integrated data storage
Processing Stage	Early stage of ETL process (before transformation is complete)	Post-transformation, ready for querying or further loading
Persistence	Short-term; data is usually discarded after loading	Long-term; integrated data is stored for querying

Data warehouse I

Week 2