Star Schema Design
How the flat Crunchbase CSV decomposes into a dimensional warehouse
A star schema organizes data into a central fact table (one row per startup, containing all numeric measures) surrounded by dimension tables that describe the "who, what, when, where" context. This design enables fast aggregation across any combination of dimensions.
dim_market
736 rows
market_id (PK)
market_name
dim_country
113 rows
country_id (PK)
country_code
dim_time
25 rows
time_id (PK)
year
decade
era
dim_stage
11 rows
stage_id (PK)
stage_name
stage_order
fact_startup
40477 rows
market_id (FK)
funding_total_usd
country_id (FK)
funding_rounds
time_id (FK)
seed, venture, angel
status_encoded
round_A through round_H
agg_market_year
Market × Year
agg_country_stage
Country × Stage
agg_market_country
Market × Country
ETL Pipeline
Extract, Transform, Load; data quality at each step
Raw Load
54,294 → 54,294Raw CSV loaded
Null Drop
54,294 → 40,477Dropped rows missing status/country/market
13,817 rows dropped (25.4%)
Feature Engineering
40,477 → 40,477Added 12 derived features
Star Schema
40,477 → 40,4774 dims, 1 fact, 3 cubes
status
11.4%
null rate
country_code
18.7%
null rate
market
16.3%
null rate
funding_total_usd
24.7%
null rate
founded_at
29%
null rate