Paris-Saclay University

Data warehouse I

Week 1

Introduction

Course Overview

Course Objectives
  • Understand the fundamentals of data warehousing
  • Learn about data warehouse architecture and design
  • Explore ETL processes and data modeling
  • Gain hands-on experience with open-source DW tools

Course Overview

Course Structure
  • 7 weeks of lectures and labs
  • Weekly quizzes and in-class exercises
  • Final exam in Week 8

Importance of Data Warehousing

  • Enabling informed decision-making
  • Consolidating data from multiple sources
  • Providing historical perspective on business performance
  • Supporting complex queries and analytics
  • Improving data quality and consistency

Challenges for Databases

The Data Explosion

  • Volume: Exponential growth in data generation
  • Variety: Structured, semi-structured, and unstructured data
  • Velocity: Speed of data generation and processing

Need for Real-time Analytics

  • Businesses require up-to-the-minute insights

Data Integration Challenges

  • Multiple data sources with different formats
  • Data silos in organizations
  • Ensuring data consistency across sources


Data Quality and Consistency

  • Common data quality issues:
    • Duplicates, missing values, inconsistent formats
  • Impact of poor data quality:
    • Incorrect analysis and decision-making
    • Loss of customer trust
  • Importance of data cleansing and validation

Scalability and Performance

  • Handling growing data volumes
  • Maintaining query performance as data size increases
  • Balancing read and write operations


Security and Privacy Concerns

  • Protecting sensitive data (e.g., personal information, financial data)
  • Compliance with regulations (e.g., GDPR, CCPA)
  • Balancing data accessibility with security
  • Challenges of data anonymization

Brief History of Databases

1960s - Early Database Systems

  • Hierarchical Databases
    • Tree-like structure
    • Limitations: Inflexibility, data redundancy
  • Network Databases
    • Based on the CODASYL model
    • Improvement over hierarchical, but still complex

1970s - Relational Databases and SQL

  • Introduction of the relational model by E.F. Codd (1970)
  • Key concepts: Tables, rows, columns, keys
  • Development of SQL (Structured Query Language)
  • First commercial RDBMS: Oracle (1979)

1980s - Object-Oriented Databases

  • Designed to handle complex data structures
  • Integration with object-oriented programming languages
  • Examples: Versant, ObjectStore
  • Limited adoption compared to relational databases

1990s - Rise of Data Warehousing and OLAP

  • Inmon and Kimball's data warehouse methodologies
  • Introduction of star schema and snowflake schema
  • Development of OLAP (Online Analytical Processing) tools
  • Separation of transactional and analytical systems

2000s - NoSQL and Big Data

  • NoSQL databases emerge to handle web-scale data
    • Types: Document, Key-value, Column-family, Graph
  • Examples: MongoDB, Cassandra, Neo4j
  • Big Data technologies: Hadoop, MapReduce
  • Emphasis on scalability and flexibility

2010s - Cloud Databases and NewSQL

  • Cloud-based database services (DBaaS)
    • Examples: Amazon RDS, Google Cloud SQL, Azure SQL Database
  • NewSQL: Combining ACID properties with NoSQL scalability
    • Examples: Google Spanner, CockroachDB
  • Increased focus on distributed systems and global scale

Current Trends

  • AI/ML integration in databases
    • Automated tuning, predictive analytics
  • Graph databases for complex relationship analysis
  • Multi-model databases
  • Blockchain in database management

Founding Principles of DBMS

Data Independence

  • Physical Data Independence
    • Changes in storage structures don't affect application programs
  • Logical Data Independence
    • Changes in logical schema don't affect application programs
  • Benefits: Flexibility, maintainability, scalability

ACID Properties

  • Atomicity: All-or-nothing transaction execution
  • Consistency: Database remains in a valid state after transaction
  • Isolation: Concurrent transactions don't interfere with each other
  • Durability: Committed transactions are permanent

Atomicity

  +---------------------+            +---------------------+
  |    Account A        |            |    Account B        |
  |  Balance: $1000     |            |  Balance: $500      |
  |---------------------|            |---------------------|
  |  Debit: $100 (-)    |----------> |  Credit: $100 (+)   |
  +---------------------+            +---------------------+
            |                                      |
            |                                      |
            V                                      V
      Atomicity:                            Atomicity:
  Both actions succeed                Both actions succeed
  or none do.                         or none do.
          

Consistency

    Total balance remains the same: $1000 + $500 = $1500
    Consistency: Total amount conserved.

Isolation

[Transaction 1]                       [Transaction 2]
+--------------------+                +---------------------+
| Transfer: A to B   |                | Transfer: C to D    |
|--------------------|                |---------------------|
| Isolated from      |                | Isolated from       |
| Transaction 2      |                | Transaction 1       |
+--------------------+                +---------------------+
    Isolation: Transactions don't interfere with each other
          

Durability

                            CRASH
                            -----
      Durability: Balances are saved, even after failure

  After system restarts:
  +---------------------+            +---------------------+
  |    Account A        |            |    Account B        |
  |  Balance: $900      |            |  Balance: $600      |
  +---------------------+            +---------------------+
          

Concurrency Control

  • Managing simultaneous access to data
  • Techniques:
    • Locking (Shared locks, Exclusive locks)
    • Multiversion Concurrency Control (MVCC)
  • Dealing with deadlocks
  • Ensuring data consistency in multi-user environments

Data Integrity and Constraints

  • Entity Integrity: Primary Key constraints
  • Entity Integrity: Foreign Key constraints
  • Domain Integrity: Data type, format constraints
  • User-Defined Integrity: Custom business rules
  • Importance in maintaining data quality

Query Optimization

  • Process of selecting the most efficient query execution plan
  • Components of query optimization:
    • Query rewriting
    • Statistics and cost estimation
    • Join order selection
  • Impact on database performance

Transaction Management

  • Definition of database transactions
  • Transaction states: Active, Partially Committed, Failed, Aborted, Committed
  • Transaction scheduling
  • Handling transaction failures and system crashes

Recovery Mechanisms

  • Ensuring data persistence and consistency after failures
  • Recovery techniques:
    • Write-Ahead Logging (WAL)
    • Checkpointing
    • Rollback and Rollforward operations
  • Balancing performance and reliability

In-class Exercise: DBMS Principles

Exercise Instructions

  • Form groups of 3-4 students
  • Each group will be assigned a real-world scenario
  • Identify which DBMS principles are most relevant to the scenario
  • Discuss how these principles address the challenges in the scenario
  • Prepare a brief presentation of your findings

Scenario Examples

  1. Online banking system handling thousands of transactions per second
  2. E-commerce platform updating inventory across multiple warehouses
  3. Healthcare system managing patient records with strict privacy requirements
  4. Social media platform supporting millions of concurrent users
link

Data Warehouse: Motivations

Limitations of Operational Databases for Analytics

  • Designed for day-to-day transactions, not complex queries
  • Performance impact of analytical queries on operational systems
  • Lack of historical data retention
  • Data scattered across multiple systems
Analyst: "How many sales completed in dec. before Christmas per group of product and discount?"

              SELECT Y.year, PG.name, DI.disc, count(*)
              FROM year Y, month M, day D, session S,
                line_item I, order O, product P, productgroup PG, 
                discount DI, order_status OS
              WHERE M.year_id = Y.id and
                D.month_id = M.id and
                S.day_id = D.id and
                O.session_id = S.id and
                I.order_id = O.id and
                I.product_id = P.id and
                P.productgroup_id = PG.id and
                DI.productgroup_id = PG.id and
                O.id = OS.order_id and
                D.day < 24 and
                M.month = 12
                and OS.status='FINISHED'
              GROUP BY Y.year, PG.name, DI.discount
              ORDER BY Y.year, DI.discount
            
Source: Ulf Leser, Data Warehouses course
Large relationships (millions of orders, sessions), numerous joins ⇒ potentially difficult query.

Need for Historical and Aggregated Data

  • Business requirements for trend analysis
  • Comparing current performance with historical data
  • Aggregations for different time periods (daily, monthly, yearly)


Support for Complex Queries and Reporting

  • Ad-hoc querying capabilities
  • Handling multi-dimensional analysis
  • Rapid response times for large datasets
  • Supporting various reporting tools and dashboards

Improved Decision-Making and Business Intelligence

  • Providing a single, consistent view of business data
  • Enabling data-driven decision making
  • Supporting predictive analytics and forecasting


Data Consolidation and Single Version of Truth

  • Integrating data from multiple sources
  • Resolving data inconsistencies and conflicts
  • Providing a unified view of the organization
  • Ensuring data quality and consistency across the enterprise

Data Warehouse: Definitions

Bill Inmon's Definition

"A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process."
  • Subject-oriented: Organized around major subjects (e.g., customer, product)
  • Integrated: Consistent naming conventions, formats, encoding structures
  • Time-variant: Explicitly contains time dimension
  • Non-volatile: Data is stable and doesn't change once it's in the warehouse

Ralph Kimball's Definition

"A copy of transaction data specifically structured for query and analysis."
  • Key aspects of Kimball's approach:
    • Dimensional modeling
    • Bus architecture
    • Focus on business processes

Key Characteristics of a Data Warehouse

  • Centralized repository
  • Optimized for reading and analysis
  • Contains both detailed and summarized data
  • Supports time series and trend analysis
  • Metadata-driven

Comparison with Operational Databases

Aspect Data Warehouse Operational Database
Purpose Analytics Transactions
Data model Dimensional Normalized
Data freshness Periodic updates Real-time
Query complexity Complex, unpredictable Simple, predictable
User base Analysts, executives Clerks, customers

OLTP vs OLAP

OLTP (Online Transaction Processing)

  • Characteristics:
    • Handles day-to-day transactions
    • Short, simple transactions
    • High concurrency
    • Predictable, repetitive queries
  • Use cases:
    • Banking transactions
    • Airline reservations
    • Order processing

Database Design for OLTP

  • Normalized data model (3NF)
  • Optimized for write operations
  • Index design for quick lookups
  • Focus on data integrity and consistency

OLAP (Online Analytical Processing)

  • Characteristics:
    • Supports complex analytical queries
    • Aggregations and summarizations
    • Lower concurrency, longer-running queries
    • Historical and predictive analysis
  • Use cases:
    • Sales analysis
    • Financial reporting
    • Customer segmentation

Multidimensional Data Model

  • Dimensions: Descriptive attributes (e.g., time, product, location)
  • Measures: Numerical values for analysis
  • Cube structure: Allows quick slicing and dicing of data
...
OLAP cube slicing

Comparison of OLTP and OLAP

Aspect OLTP OLAP
Workload Many short, atomic transactions Few complex queries
Data model Highly normalized Typically denormalized (star or snowflake schema)
User types Clerks, customers, automated processes Knowledge workers, business analysts, executives
Records accessed Tens Millions

In-class Exercise: OLTP vs OLAP

Exercise Instructions

  • Students will be given a list of business scenarios
  • For each scenario, identify whether it's better suited for OLTP or OLAP
  • Justify your choice based on the characteristics we've discussed
  • We'll discuss the answers as a class

Scenario Examples

  1. Processing customer orders on an e-commerce website
  2. Analyzing customer buying patterns over the last 5 years
  3. Updating inventory levels after each sale
  4. Generating a report on the top-selling products by region
  5. Recording patient visits in a hospital
  6. Predicting future sales based on historical data and market trends
link

DW Industrial Landscape

Major Players in the DW Market

  • Traditional vendors:
    • Oracle
    • IBM (Db2)
    • Microsoft (SQL Server)
    • Teradata
  • Cloud-native solutions:
    • Amazon Redshift
    • Google BigQuery
    • Snowflake
  • Open-source options:
    • Apache Hive
    • Presto

Cloud Data Warehouse Solutions

  • Benefits of cloud data warehouses:
    • Scalability
    • Cost-effectiveness
    • Managed services
    • Integration with cloud ecosystems
  • Comparison of leading cloud DW solutions
  • Hybrid and multi-cloud strategies

Open-source Data Warehouse Tools

  • Apache Hadoop ecosystem:
    • Hive, HBase, Impala
  • Analytical databases:
    • ClickHouse, Apache Druid
  • ETL and data integration:
    • Apache NiFi, Talend Open Studio
  • Visualization and BI:
    • Apache Superset, Metabase

Emerging Trends in Data Warehousing

  • Real-time data warehousing:
    • Streaming data integration
    • Real-time analytics
  • Data lake integration:
    • Data lakehouse concept
    • Unified analytics on structured and unstructured data
  • Machine learning and AI integration:
    • Automated data preparation
    • Predictive analytics within the data warehouse
    • Natural language querying

Conclusion and Preview

Recap of Key Points

  • Challenges driving the need for data warehouses
  • Evolution of database technologies
  • Fundamental principles of DBMS
  • Distinctions between OLTP and OLAP systems
  • Current landscape and trends in data warehousing

Preview of Next Week's Topics

  • Data warehouse architectures
  • Dimensional modeling
  • Introduction to ETL processes

Q&A

If you have any questions or comments, please do not hesitate to contact me:

pierre[dash]henri[dot]paris[at]universite[dash]paris[dash]saclay[dot]fr