Star Schema Design

How the flat Crunchbase CSV decomposes into a dimensional warehouse

A star schema organizes data into a central fact table (one row per startup, containing all numeric measures) surrounded by dimension tables that describe the "who, what, when, where" context. This design enables fast aggregation across any combination of dimensions.

dim_market

736 rows

market_id (PK)

market_name

dim_country

113 rows

country_id (PK)

country_code

dim_time

25 rows

time_id (PK)

year

decade

era

dim_stage

11 rows

stage_id (PK)

stage_name

stage_order

▼

fact_startup

40477 rows

market_id (FK)

funding_total_usd

country_id (FK)

funding_rounds

time_id (FK)

seed, venture, angel

status_encoded

round_A through round_H

▼

agg_market_year

Market × Year

agg_country_stage

Country × Stage

agg_market_country

Market × Country

ETL Pipeline

Extract, Transform, Load; data quality at each step

Raw Load

54,294 → 54,294

Raw CSV loaded

Null Drop

54,294 → 40,477

Dropped rows missing status/country/market

13,817 rows dropped (25.4%)

Feature Engineering

40,477 → 40,477

Added 12 derived features

Star Schema

40,477 → 40,477

4 dims, 1 fact, 3 cubes

status

11.4%

null rate

country_code

18.7%

null rate

market

16.3%

null rate

funding_total_usd

24.7%

null rate

founded_at

29%

null rate