Skip to main content

Data Modeling for BI

Effective data modeling creates a foundation for reliable and consistent business reporting. Learn how proper data organization improves decision-making, ensures data quality, and makes analytics more accessible to business users.

Dimensional Modeling Fundamentals

Dimensional Model Overview

Core components of dimensional models and their relationships

100%
🔍 Use Ctrl+Scroll to zoom
ContainsContainsContainsReferencesUsesUsesFactTablesDimensionTablesMeasuresAttributesHierarchiesSurrogateKeys

Legend

Components
Fact
Dimension
Key
Connection Types
Process Flow
Contains
References
Uses

Key Dimensional Modeling Concepts

📊

Facts vs. Dimensions

Technical

The distinction between facts and dimensions is central to business analytics:

  • Facts represent business processes and events that can be measured
  • Dimensions provide context and filtering capabilities for analysis
  • Proper modeling allows flexible "slicing and dicing" of metrics
  • This approach aligns with how business users naturally think about data
  • Enables both high-level summaries and detailed drill-downs
📊

Facts vs. Dimensions

Non-Technical

Technical implementation details for facts and dimensions:

  • Fact tables contain foreign keys to dimensions and numerical measures
  • Facts are typically normalized and partitioned by date
  • Dimension tables contain descriptive attributes and hierarchies
  • Dimensions often use surrogate keys generated during ETL
  • Fact grain determines the level of detail captured in analysis

Star Schema Design

Star Schema Architecture

The structure of a star schema with a central fact table and related dimension tables

100%
🔍 Use Ctrl+Scroll to zoom
date_keyproduct_keycustomer_keystore_keypromo_keySalesFactDateDimensionProductDimensionCustomerDimensionStoreDimensionPromotionDimension

Legend

Components
Fact
Dimension
Connection Types
Process Flow
date_key
product_key
customer_key

Example Star Schema Design

-- Dimension: Date
CREATE TABLE dim_date (
date_key INT PRIMARY KEY,
full_date DATE UNIQUE,
day_of_week VARCHAR(10),
day_of_month INT,
month_number INT,
month_name VARCHAR(10),
quarter INT,
year INT,
is_weekend BOOLEAN,
is_holiday BOOLEAN,
fiscal_year INT,
fiscal_quarter INT
);

-- Dimension: Product
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id VARCHAR(50) UNIQUE, -- natural key
product_name VARCHAR(100),
product_description TEXT,
brand VARCHAR(50),
category VARCHAR(50),
subcategory VARCHAR(50),
department VARCHAR(50),
unit_cost DECIMAL(10,2),
unit_price DECIMAL(10,2),
current_flag BOOLEAN,
effective_date DATE,
expiration_date DATE
);

-- Dimension: Customer
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id VARCHAR(50) UNIQUE, -- natural key
customer_name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
address_line1 VARCHAR(100),
address_line2 VARCHAR(100),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),
country VARCHAR(50),
customer_segment VARCHAR(50),
acquisition_date DATE,
lifetime_value DECIMAL(10,2),
current_flag BOOLEAN,
effective_date DATE,
expiration_date DATE
);

-- Fact: Sales
CREATE TABLE fact_sales (
sales_key BIGINT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
customer_key INT REFERENCES dim_customer(customer_key),
store_key INT REFERENCES dim_store(store_key),
promotion_key INT REFERENCES dim_promotion(promotion_key),
transaction_id VARCHAR(50),
sales_amount DECIMAL(12,2),
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
net_amount DECIMAL(12,2),
tax_amount DECIMAL(10,2),
gross_profit DECIMAL(12,2),
transaction_timestamp TIMESTAMP,
return_flag BOOLEAN
);
❄️

Star vs. Snowflake Schema

Technical

The choice between star and snowflake has business implications:

  • Star schemas provide faster reports and dashboards for end users
  • Snowflake schemas offer better data consistency for reference data
  • Star schemas are more intuitive for business users creating reports
  • Snowflake schemas may be better for managing hierarchical dimensions
  • Most organizations prioritize query performance over storage efficiency
❄️

Star vs. Snowflake Schema

Non-Technical

Technical comparison of star and snowflake schemas:

  • Star schemas denormalize dimensions for simpler queries and faster joins
  • Snowflake schemas normalize dimensions to reduce redundancy
  • Star schemas typically have fewer joins and better query performance
  • Snowflake schemas save storage space but add query complexity
  • Star schemas are preferred in most modern data warehouses

Slowly Changing Dimensions

🔄

Slowly Changing Dimension Types

Technical

SCD strategies allow businesses to manage changing reference data:

  • Track customer attribute changes for accurate point-in-time reporting
  • Maintain historical product categorizations for trend analysis
  • Support compliance requirements by preserving data lineage
  • Enable accurate historical comparisons despite master data changes
  • Balance history preservation with performance and storage needs
🔄

Slowly Changing Dimension Types

Non-Technical

Technical implementation of SCD types:

  • Type 1: Overwrite the attribute value (no history)
  • Type 2: Add a new row with effective dates and current flag
  • Type 3: Add new columns for previous values (limited history)
  • Type 4: Use history tables for full change tracking
  • Type 6: Hybrid approach combining Types 1, 2, and 3

SCD Type 2 Implementation Example

-- SCD Type 2 dimension with history tracking
CREATE TABLE dim_customer (
customer_key SERIAL PRIMARY KEY, -- surrogate key
customer_id VARCHAR(50), -- natural key (business key)
first_name VARCHAR(50),
last_name VARCHAR(50),
email VARCHAR(100),
phone VARCHAR(20),
address VARCHAR(100),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),
segment VARCHAR(50),
effective_date DATE NOT NULL, -- when this version became effective
expiration_date DATE, -- when this version expired (null if current)
current_flag BOOLEAN NOT NULL, -- indicates if this is current version

-- Create an index on the combination of natural key and current flag
-- for efficient lookups during ETL
CONSTRAINT idx_customer_current UNIQUE (customer_id, current_flag)
WHERE current_flag = TRUE
);

-- Query to handle a customer address change in Type 2 SCD
BEGIN TRANSACTION;

-- Step 1: Set expiration and current flag on existing record
UPDATE dim_customer
SET
expiration_date = CURRENT_DATE - 1,
current_flag = FALSE
WHERE
customer_id = '12345'
AND current_flag = TRUE;

-- Step 2: Insert new record with updated address
INSERT INTO dim_customer (
customer_id, first_name, last_name, email, phone,
address, city, state, postal_code, segment,
effective_date, expiration_date, current_flag
)
SELECT
customer_id, first_name, last_name, email, phone,
'New Address', city, state, postal_code, segment, -- Only address changes
CURRENT_DATE, NULL, TRUE
FROM dim_customer
WHERE
customer_id = '12345'
AND expiration_date = CURRENT_DATE - 1;

COMMIT;

Aggregate Tables and Materialized Views

📈

Aggregation Strategies

Technical

Aggregation strategies deliver significant business benefits:

  • Dashboard performance improvements of 10-100x for large datasets
  • Reduced operational costs through efficient resource utilization
  • Support for interactive data exploration at scale
  • Enablement of mobile BI with low latency requirements
  • Cost-effective scaling of analytics to more business users
📈

Aggregation Strategies

Non-Technical

Technical implementation of aggregation strategies:

  • Precalculated aggregate tables with dimension foreign keys
  • Materialized views with incremental refresh mechanisms
  • Summary tables with different temporal granularities
  • Aggregate navigation in semantic layers
  • Indexing strategies for aggregate tables

Materialized Aggregate Example

-- Create a daily sales aggregate table
CREATE TABLE agg_daily_sales (
date_key INT REFERENCES dim_date(date_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),

-- Aggregate measures
total_sales_amount DECIMAL(14,2),
total_quantity INT,
total_discount DECIMAL(12,2),
transaction_count INT,
customer_count INT,

-- Meta information
last_updated_timestamp TIMESTAMP,

-- Primary key
PRIMARY KEY (date_key, product_key, store_key)
);

-- Population/refresh logic
INSERT INTO agg_daily_sales
SELECT
f.date_key,
f.product_key,
f.store_key,
SUM(f.sales_amount) AS total_sales_amount,
SUM(f.quantity) AS total_quantity,
SUM(f.discount_amount) AS total_discount,
COUNT(DISTINCT f.transaction_id) AS transaction_count,
COUNT(DISTINCT f.customer_key) AS customer_count,
CURRENT_TIMESTAMP AS last_updated_timestamp
FROM
fact_sales f
WHERE
f.date_key >= (SELECT MAX(date_key) FROM agg_daily_sales) - 7 -- Incremental refresh
GROUP BY
f.date_key,
f.product_key,
f.store_key
ON CONFLICT (date_key, product_key, store_key)
DO UPDATE SET
total_sales_amount = EXCLUDED.total_sales_amount,
total_quantity = EXCLUDED.total_quantity,
total_discount = EXCLUDED.total_discount,
transaction_count = EXCLUDED.transaction_count,
customer_count = EXCLUDED.customer_count,
last_updated_timestamp = CURRENT_TIMESTAMP;

Conformed Dimensions and Facts

Enterprise Data Warehouse Architecture

Structure of a data warehouse with shared conformed dimensions across multiple fact tables

100%
🔍 Use Ctrl+Scroll to zoom
date_keyproduct_keycustomer_keystore_keydate_keyproduct_keystore_keydate_keycustomer_keySalesFactInventoryFactMarketingFactDateDimensionProductDimensionCustomerDimensionStoreDimension

Legend

Components
Fact
Dimension
Connection Types
Process Flow
date_key
product_key
customer_key
🔄

Conformed Dimensions

Technical

Conformed dimensions deliver essential business capabilities:

  • Ensures consistent reporting across different business processes
  • Enables integrated dashboards that combine metrics from multiple areas
  • Facilitates drill-across capabilities between related subject areas
  • Creates a shared business vocabulary across the organization
  • Supports enterprise-wide analytics governance
🔄

Conformed Dimensions

Non-Technical

Technical implementation of conformed dimensions:

  • Shared dimension tables referenced by multiple fact tables
  • Consistent surrogate key generation across data pipelines
  • Master data management integration for consistency
  • Attribute standardization and centralized business rules
  • Dimension versioning coordination across subject areas

Data Vault Modeling

🏛️

Data Vault Architecture

Technical

Data Vault modeling addresses key business challenges:

  • Adaptability to changing business requirements without model redesign
  • Auditability for regulatory compliance with full data lineage
  • Integration of disparate data sources while preserving source information
  • Scalability for enterprise data volumes with parallel loading
  • Historical preservation across the enterprise data landscape
🏛️

Data Vault Architecture

Non-Technical

Technical implementation of Data Vault modeling:

  • Hub tables that store business keys with minimal context
  • Link tables that maintain relationships between business keys
  • Satellite tables that store descriptive attributes and history
  • Hash keys for performance and referential integrity
  • Layered architecture with raw, business, and presentation vaults

Data Vault Core Components

-- Hub: Customer (business entity)
CREATE TABLE hub_customer (
customer_hk CHAR(32) PRIMARY KEY, -- hash key of business key
customer_bk VARCHAR(50) NOT NULL, -- business key
record_source VARCHAR(100) NOT NULL,
load_date TIMESTAMP NOT NULL,
UNIQUE(customer_bk)
);

-- Link: Customer to Account relationship
CREATE TABLE link_customer_account (
link_hk CHAR(32) PRIMARY KEY, -- hash of relationship
customer_hk CHAR(32) NOT NULL REFERENCES hub_customer(customer_hk),
account_hk CHAR(32) NOT NULL REFERENCES hub_account(account_hk),
record_source VARCHAR(100) NOT NULL,
load_date TIMESTAMP NOT NULL,
UNIQUE(customer_hk, account_hk)
);

-- Satellite: Customer details
CREATE TABLE sat_customer_details (
customer_hk CHAR(32) NOT NULL REFERENCES hub_customer(customer_hk),
hashdiff CHAR(32) NOT NULL, -- hash of all attributes
effective_from TIMESTAMP NOT NULL,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL,

-- Descriptive attributes
customer_name VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
address VARCHAR(200),
city VARCHAR(50),
state VARCHAR(50),
postal_code VARCHAR(20),

-- Audit/metadata attributes
is_current BOOLEAN,
expiry_date TIMESTAMP,

PRIMARY KEY (customer_hk, effective_from)
);

Best Practices for BI Data Modeling

Design Principles

  1. Prioritize Query Performance

    • Denormalize dimensions where appropriate
    • Create indexes aligned with common query patterns
    • Implement aggregates for frequently used metrics
  2. Plan for Scalability

    • Design partitioning strategies aligned with data volumes
    • Implement incremental processing patterns
    • Consider distribution keys in MPP databases
  3. Balance Flexibility and Standardization

    • Create conformed dimensions for cross-subject analysis
    • Document business rules and calculations
    • Establish naming conventions and metadata standards
  4. Design for Understandability

    • Use business terminology in model naming
    • Document metrics and dimension definitions
    • Create intuitive hierarchies for drill-down analysis
  5. Implement Proper History Management

    • Choose appropriate SCD types based on business requirements
    • Document time variance policies for dimensions
    • Consider bi-temporal modeling for regulated industries

Additional Resources