Introduction

Challenges for Databases

Data Integration Challenges

Multiple data sources with different formats
Data silos in organizations
Ensuring data consistency across sources

A multinational company integrating sales data from different countries

Multiple data sources with different formats:
- Challenge: Organizations often have data spread across various systems (e.g., CRM, ERP, HR systems) in different formats (databases, spreadsheets, text files).
- Impact: Makes it difficult to get a complete view of the business or perform cross-functional analysis.
- Solution approach: ETL (Extract, Transform, Load) processes are used to consolidate and standardize data.
Data silos in organizations:
- Definition: Isolated pockets of data that are not easily accessible by other parts of the organization.
- Causes: Often result from departmental systems, acquisitions, or legacy technology.
- Problem: Leads to incomplete information for decision-making and potential data inconsistencies.
- Solution: Data warehouses aim to break down these silos by integrating data from across the organization.
Ensuring data consistency across sources:
- Challenge: Different systems may represent the same data differently (e.g., date formats, customer IDs).
- Importance: Inconsistent data can lead to incorrect analysis and poor decision-making.
- Approach: Data governance policies and data quality processes are crucial in maintaining consistency.
Real-world example: A multinational company integrating sales data from different countries faces challenges like:
- Different currencies and exchange rate fluctuations
- Varying fiscal years and reporting standards
- Multiple languages and cultural differences in data entry
- Diverse local systems and data formats
Why this matters:
- Effective data integration is foundational for accurate analytics and reporting.
- It enables a holistic view of the organization, supporting better strategic decisions.
- Addressing these challenges is a key part of building a successful data warehouse.

Think about a company you're familiar with (perhaps from a previous job or internship). What different types of data systems might they use, and what challenges might they face in integrating all their data?

1960s - Early Database Systems

Hierarchical Databases
- IBM's Information Management System (IMS)
- Tree-like structure
- Limitations: Inflexibility, data redundancy
Network Databases
- Integrated Data Store (IDS)
- Based on the CODASYL model
- Improvement over hierarchical, but still complex

Context:
- 1960s: Computers were just beginning to be used for business data processing.
- Challenge: How to efficiently store and retrieve large amounts of data.
Hierarchical Databases:
- Structure: Tree-like, parent-child relationships.
- Example: IBM's Information Management System (IMS)
- How it works: Data is organized in a hierarchy, like an organizational chart.
- Strengths: Efficient for certain types of relationships (e.g., parts in assemblies).
- Limitations:
  - Difficulty representing many-to-many relationships.
  - Inflexible for changing business needs.
  - Complex querying for data not following the hierarchy.
Network Databases:
- Structure: Based on the CODASYL model, allowing more complex relationships.
- Example: Integrated Data Store (IDS)
- Improvement over hierarchical: Could represent more complex relationships.
- How it works: Uses records and sets to create network-like connections between data.
- Limitations:
  - Still complex to manage and query.
  - Required detailed knowledge of the database structure to navigate data.
Legacy Impact:
- Some of these systems (especially IMS) are still used today in certain industries (banking, insurance) due to their reliability and the cost of migration.
Historical Significance:
- These early systems laid the groundwork for future database development.
- Their limitations directly influenced the development of the relational model in the 1970s.

Study Tip: Try to visualize how you would structure a simple dataset (e.g., employees in a company) using a hierarchical model. Then, think about what kinds of questions would be easy or difficult to answer with this structure. This exercise can help you understand why these models were eventually superseded.

1970s - Relational Databases and SQL

Introduction of the relational model by E.F. Codd (1970)
Key concepts: Tables, rows, columns, keys
Development of SQL (Structured Query Language)
First commercial RDBMS: Oracle (1979)

The Relational Model:
- Introduced by E.F. Codd in 1970.
- Core idea: Represent data in tables with rows and columns, using relationships between these tables.
- Revolutionary because: It provided a more flexible and intuitive way to structure and query
Key Concepts:
- Tables (Relations): Each table represents an entity or concept.
- Rows (Tuples): Each row is a specific instance or record.
- Columns (Attributes): Represent properties or characteristics of the entity.
- Keys: Used to uniquely identify rows and establish relationships between tables.
Advantages over earlier models:
- Flexibility: Easier to modify the database structure as business needs change.
- Ad-hoc querying: Users can ask complex questions without needing to understand the physical data storage.
- Data independence: Changes to the physical storage don't affect the logical view of the data.
SQL (Structured Query Language):
- Developed to interact with relational databases.
- Standardized language for querying and managing relational databases.
- Made databases accessible to a wider range of users, not just specialized programmers.
First Commercial RDBMS:
- Oracle, founded in 1977, released its first commercial SQL-based RDBMS in 1979.
- Other early players: IBM's System R (research project that influenced SQL development).
Impact:
- Relational databases quickly became the dominant model for data management.
- Concepts from the relational model still underpin much of modern data management, including in data warehouses.
Modern relevance:
- Most business applications today use relational databases.
- SQL remains the standard language for database interaction, with various dialects for different systems.

Practice designing a simple relational database. For example, try to model a library system with books, authors, and borrowers. Think about how you would structure the tables and relationships. This exercise will help you grasp the fundamental concepts of relational database design.

1980s - Object-Oriented Databases

Designed to handle complex data structures
Integration with object-oriented programming languages
Examples: Versant, ObjectStore
Limited adoption compared to relational databases

Object-Oriented Databases (OODBs) - 1980s:
1. Purpose:
  - Designed to handle complex data structures that were difficult to represent in relational databases.
  - Aimed to bridge the gap between object-oriented programming and database management.
2. Key features:
  - Data stored as objects, mirroring object-oriented programming concepts.
  - Support for complex data types and relationships.
  - Ability to store and retrieve complete object structures.
3. Integration with object-oriented programming languages:
  - Allowed seamless interaction between databases and OOP languages like C++ and Java.
  - Reduced the "impedance mismatch" problem (the difficulty of translating between programming objects and relational database structures).
  - Example: A Java object could be directly stored in and retrieved from the database without needing to be decomposed into tables.
4. Examples of OODBs:
  - Versant: One of the early commercial OODBs, known for its performance.
  - ObjectStore: Another pioneering OODB, focused on scalability and integration with C++.
  - Other notable examples: Objectivity/DB, Db4o (database for objects)
5. Limited adoption compared to relational databases:
  - Despite initial enthusiasm, OODBs didn't become as widespread as expected.
  - Reasons for limited adoption:
    - Maturity and widespread use of relational databases
    - Complexity of object-oriented data modeling
    - Lack of standardization (each OODB had its own way of doing things)
    - Performance issues with complex queries
    - Limited support for ad-hoc querying compared to SQL
6. Legacy and influence:
  - While pure OODBs are not widely used, their concepts influenced:
    - Object-relational databases (e.g., PostgreSQL with its object-relational features)
    - NoSQL databases, particularly document databases
  - Some niche applications in fields like computer-aided design (CAD) and scientific databases still use OODBs
Why this is important to understand:
- Shows the evolution of database technology in response to programming paradigms.
- Illustrates challenges in adopting new database models, even when they have theoretical advantages.
- Helps in understanding the strengths and limitations of different database types.

Think about an application you use regularly (e.g., a social media app). Try to imagine how its data might be structured as objects (User objects, Post objects, etc.). Then consider why a relational model might be chosen instead. This exercise will help you grasp the trade-offs between object-oriented and relational data models.

Current Trends

AI/ML integration in databases
- Automated tuning, predictive analytics
Graph databases for complex relationship analysis
Multi-model databases
Blockchain in database management

AI/ML integration in databases:
- Automated tuning: Databases can optimize themselves based on usage patterns.
- Predictive analytics: Embedding machine learning models directly into the database for real-time predictions.
- Why it matters: Reduces need for manual database administration and enables more sophisticated,
Graph databases:
- Purpose: Optimized for managing and querying highly interconnected data.
- Use cases: Social network analysis, fraud detection, recommendation engines.
- Examples: Neo4j, Amazon Neptune
- Why it's important: Enables efficient analysis of relationships in data, which is challenging in traditional relational databases.
Multi-model databases:
- Definition: Databases that can store and process multiple data models (e.g., relational, document, graph) in a single system.
- Advantage: Provides flexibility to handle diverse data types and queries within one database system.
- Examples: ArangoDB, OrientDB
- Why it matters: Simplifies data architecture and reduces the need for multiple specialized databases.
Blockchain in database management:
- Application: Using blockchain technology to create tamper-proof audit trails and ensure data integrity.
- Potential use cases: Financial transactions, supply chain management, healthcare records.
- Current status: Still an emerging area, with more potential than widespread adoption.
- Why it's significant: Could revolutionize how we ensure data integrity and trust in distributed systems.
Overall impact of these trends:
- Databases are becoming more intelligent, flexible, and capable of handling diverse and complex data.
- The lines between traditional databases, data warehouses, and analytics platforms are blurring.
- These advancements are enabling new types of applications and business models.
Challenges:
- Keeping up with rapidly evolving technology
- Ensuring security and privacy with more complex systems
- Managing the increased complexity these advanced features bring

For each trend, try to think of a specific application or company that might benefit from it. For example, how might a social media company use graph databases? How could a bank leverage blockchain in its database systems? This exercise will help you connect these trends to real-world scenarios.

Atomicity

  +---------------------+            +---------------------+
  |    Account A        |            |    Account B        |
  |  Balance: $1000     |            |  Balance: $500      |
  |---------------------|            |---------------------|
  |  Debit: $100 (-)    |----------> |  Credit: $100 (+)   |
  +---------------------+            +---------------------+
            |                                      |
            |                                      |
            V                                      V
      Atomicity:                            Atomicity:
  Both actions succeed                Both actions succeed
  or none do.                         or none do.

Consistency

    Total balance remains the same: $1000 + $500 = $1500
    Consistency: Total amount conserved.

Isolation

[Transaction 1]                       [Transaction 2]
+--------------------+                +---------------------+
| Transfer: A to B   |                | Transfer: C to D    |
|--------------------|                |---------------------|
| Isolated from      |                | Isolated from       |
| Transaction 2      |                | Transaction 1       |
+--------------------+                +---------------------+
    Isolation: Transactions don't interfere with each other

Durability

                            CRASH
                            -----
      Durability: Balances are saved, even after failure

  After system restarts:
  +---------------------+            +---------------------+
  |    Account A        |            |    Account B        |
  |  Balance: $900      |            |  Balance: $600      |
  +---------------------+            +---------------------+

Limitations of Operational Databases for Analytics

Designed for day-to-day transactions, not complex queries
Performance impact of analytical queries on operational systems
Lack of historical data retention
Data scattered across multiple systems

Operational databases are designed for day-to-day transactions, not complex analytics:
- Optimized for quick inserts and updates, not large-scale data retrieval.
- Schema designed for operational efficiency, not analytical queries.
Performance impact of analytical queries on operational systems:
- Complex queries can slow down critical business operations.
- Example: Running a year-end sales analysis could impact the system's ability to process new orders.
Lack of historical data retention:
- Operational systems often only keep current or recent data.
- Example: If you want to analyze sales trends over the past 5 years, but your system only keeps the last 6 months of data, you can't perform the analysis.
Data scattered across multiple systems:
- In many organizations, relevant data is spread across various operational systems.
- Example: Customer information might be in a CRM system, their purchase history in an ERP system, and their support tickets in a helpdesk system.
Why these limitations led to data warehouses:
- Data warehouses are designed to address these specific challenges.
- They provide a centralized repository optimized for analytical queries.
- Allow integration of data from multiple sources into a consistent format.
- Designed to store and manage historical data effectively.
Impact on business:
- Without addressing these limitations, businesses struggle to gain comprehensive insights from their data.
- Data-driven decision making becomes challenging and time-consuming.
- Competitive advantage can be lost to more data-savvy competitors.

Think about a business you're familiar with. What kinds of analytical questions might they want to ask that would be difficult with just operational databases? How might a data warehouse help them answer these questions more effectively?

Support for Complex Queries and Reporting

Ad-hoc querying capabilities
Handling multi-dimensional analysis
Rapid response times for large datasets
Supporting various reporting tools and dashboards

Ad-hoc querying capabilities:
- Definition: Ability for users to create custom, on-the-fly queries without predefined templates.
- Why it's important: Allows business users to explore data freely, answering new questions as they arise.
- Example: A marketing manager wanting to quickly analyze the effectiveness of a campaign across different customer segments and regions.
Handling multi-dimensional analysis:
- Definition: Analyzing data across multiple dimensions simultaneously (e.g., time, geography, product).
- How it works: Data is structured in a way that allows quick "slicing and dicing" across dimensions.
- Example: Analyzing sales by product category, region, time period, and customer demographic all at once.
- Why it matters: Provides a comprehensive view of business performance and allows for deep, nuanced analysis.
Rapid response times for large datasets:
- How it's achieved: Through specific design choices like denormalization, pre-aggregation, and specialized indexing.
- Impact: Queries that might take hours on an operational system could return results in seconds in a well-designed data warehouse.
- Why it's crucial: Enables interactive analysis and rapid decision-making based on large volumes of data.
Supporting various reporting tools and dashboards:
- Types of tools: Business Intelligence (BI) software, data visualization tools, custom reporting applications.
- Examples: Tableau, Power BI, Looker, QlikView.
- Benefits:
  - Provides user-friendly interfaces for non-technical users to access and analyze data.
  - Enables creation of dynamic, interactive dashboards for monitoring key business metrics.
  - Allows for scheduled report generation and distribution.
Real-world applications:
- Financial analysis: Quickly assessing profitability across multiple product lines and regions.
- Customer segmentation: Identifying high-value customer groups based on various attributes and behaviors.
- Supply chain optimization: Analyzing inventory levels, supplier performance, and demand patterns across the entire supply network.
Skills needed:
- SQL for complex querying
- Understanding of dimensional modeling concepts
- Familiarity with BI and data visualization tools

If possible, get hands-on experience with a BI tool like Tableau Public (free version available). Try connecting to a sample dataset and creating some multi-dimensional visualizations. This practical experience will help you understand the power of these analytical capabilities.

Bill Inmon's Definition

"A subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management's decision-making process."

Subject-oriented: Organized around major subjects (e.g., customer, product)
Integrated: Consistent naming conventions, formats, encoding structures
Time-variant: Explicitly contains time dimension
Non-volatile: Data is stable and doesn't change once it's in the warehouse

"father of data warehousing"

Subject-oriented:
- Meaning: Data is organized around major subjects of the enterprise (e.g., customers, products, sales).
- Contrast with operational systems: These are often organized around specific applications or processes.
- Example: Instead of having separate data for the order system, inventory system, and customer service system, a data warehouse would organize all relevant data around the concept of "sales."
- Why it matters: Provides a business-centric view of data, making it easier for analysts to work with.
Integrated:
- Meaning: Data from different sources is merged into a consistent format.
- Challenges addressed: Resolves differences in naming conventions, encoding structures, attribute measures, etc.
- Example: Combining data where one system uses "Gender" (M/F) and another uses "Sex" (0/1) into a standardized format.
- Why it matters: Ensures consistency and reliability in reporting and analysis across the entire organization.
Time-variant:
- Meaning: The data warehouse keeps historical data, not just current data.
- How it's implemented: Often includes a time dimension in its structure.
- Example: Storing multiple versions of a product price over time, not just the current price.
- Why it matters: Enables trend analysis, year-over-year comparisons, and other time-based analytics.
Non-volatile:
- Meaning: Once data enters the warehouse, it doesn't change.
- How it works: Data is typically loaded in regular batches and is not continuously updated like in operational systems.
- Example: Yesterday's sales figures, once loaded into the warehouse, remain constant.
- Why it matters: Ensures consistent reporting results and provides a stable environment for complex queries.
In support of management's decision-making process:
- Overall purpose: To provide reliable, comprehensive data for strategic decision-making.
- Types of decisions supported: Long-term strategic planning, performance evaluation, trend analysis.
Inmon's approach (also known as Corporate Information Factory):
- Advocates for a top-down design approach.
- Emphasizes a centralized data warehouse that feeds departmental data marts.

Try to think of examples for each characteristic from a business you're familiar with. How might their data be subject-oriented? What kinds of historical data might they need to keep? This exercise will help you understand how these concepts apply in real-world scenarios.

Ralph Kimball's Definition

"A copy of transaction data specifically structured for query and analysis."

Key aspects of Kimball's approach:
- Dimensional modeling
- Bus architecture
- Focus on business processes

Interpretation of the definition:
- "Copy of transaction data": Implies that the data warehouse doesn't replace operational systems but replicates their data.
- "Specifically structured": The data is reorganized and optimized for analytical purposes.
- "For query and analysis": The primary goal is to support business intelligence and decision-making processes.
Key aspects of Kimball's approach:
1. Dimensional modeling:
  - Definition: A technique for structuring data in a way that's intuitive for business users and optimized for query performance.
  - Key components:
    - Fact tables: Contain quantitative metrics of business processes (e.g., sales amounts, quantities).
    - Dimension tables: Contain descriptive attributes (e.g., product details, customer information, time).
  - Benefits:
    - Simplifies complex queries
    - Improves query performance
    - Makes data more understandable to business users
  - Example: A sales fact table might have foreign keys to dimensions like Date, Product, Customer, and Store.
2. Bus architecture:
  - Definition: A design approach that uses standardized dimensions across different business processes.
  - How it works:
    - Identifies key business processes (e.g., sales, orders, inventory)
    - Defines conformed dimensions that can be used across these processes
  - Benefits:
    - Enables integration of data marts across the enterprise
    - Ensures consistency in reporting across different business areas
  - Example: A "Customer" dimension used consistently across sales, support, and marketing data marts.
3. Focus on business processes:
  - Approach: Organizes the data warehouse around core business processes rather than departments.
  - Why it matters:
    - Aligns the data warehouse with how the business actually operates
    - Facilitates end-to-end analysis of business processes
    - Makes the data warehouse more adaptable to organizational changes
  - Example: Focusing on an "Order to Cash" process rather than separate "Sales" and "Finance" data marts.
Kimball vs. Inmon approach:
- Kimball advocates a bottom-up approach, starting with individual data marts.
- Inmon prefers a top-down approach with a centralized data warehouse.
- Kimball's approach often allows for faster implementation and more flexibility.
Impact on data warehouse design:
- Emphasis on creating a user-friendly, business-oriented data structure.
- Use of star schemas or snowflake schemas in database design.
- Development of conformed dimensions for enterprise-wide consistency.
Skills needed to implement Kimball's approach:
- Understanding of business processes and metrics
- Proficiency in dimensional modeling techniques
- Ability to design and implement star schemas
- Knowledge of ETL processes to populate dimensional models

Try to design a simple star schema for a business process you're familiar with (e.g., sales, library book checkouts). Identify what would be in the fact table and what dimensions you'd need. This exercise will help you grasp the practical application of Kimball's dimensional modeling concept.

Aspect	Data Warehouse	Operational Database
Purpose	Analytics	Transactions
Data model	Dimensional	Normalized
Data freshness	Periodic updates	Real-time
Query complexity	Complex, unpredictable	Simple, predictable
User base	Analysts, executives	Clerks, customers

Aspect	OLTP	OLAP
Workload	Many short, atomic transactions	Few complex queries
Data model	Highly normalized	Typically denormalized (star or snowflake schema)
User types	Clerks, customers, automated processes	Knowledge workers, business analysts, executives
Records accessed	Tens	Millions

Preview of Next Week's Topics

Data warehouse architectures
Dimensional modeling
Introduction to ETL processes

A typical data warehouse architecture follows a multi-layered approach. The source layer contains the raw data from various transactional systems. The ETL (Extract, Transform, Load) layer processes this data by extracting it from multiple sources, transforming it into a consistent format, and loading it into the warehouse. The data storage layer is where the processed data resides—usually in a dimensional format to support analytical queries. Finally, the presentation layer is what users interact with through reporting tools, dashboards, or direct queries. This architecture supports the separation of concerns: operational systems handle transactions, while the warehouse manages analytics.
Dimensional modeling is a design technique used in data warehouses that focuses on simplifying complex queries. It organizes data into "facts" and "dimensions." Facts are quantitative data points (like sales amounts or quantities) while dimensions are descriptive data that provide context (like time, location, or product categories). A common representation of this model is the star schema, where facts are at the center of the "star," and dimensions surround it. The snowflake schema is a more normalized version, where dimensions themselves are further broken down into related sub-dimensions. This model helps optimize queries for fast retrieval and is especially useful for reporting and analytical tasks.
ETL (Extract, Transform, Load) processes are critical to data warehousing. Data is first extracted from multiple, often heterogeneous sources, such as databases, flat files, or APIs. It’s then transformed, which can involve data cleaning, normalization, and aggregating records. For example, sales data from different countries might need to be converted to a standard currency or date format. The transformed data is then loaded into the data warehouse for analysis. ETL processes often happen in batch jobs, especially during off-hours to reduce the load on operational systems. For real-time data warehouses, ELT (Extract, Load, Transform) processes are more common, where data is loaded first and then processed within the warehouse.

Data warehouse I

Week 1