Methodology

Data sources, preprocessing, warehouse design and model details

Dataset

This analysis uses the Crunchbase Startup Investments dataset containing 54,294 companies across 113 countries and 753 markets. The dataset includes company status (operating, acquired, closed), total funding amounts, individual funding round amounts (Seed through Round H), geographic information and founding dates.

Data Preprocessing & ETL

The ETL pipeline cleans funding amounts (removing formatting artifacts, handling missing values encoded as dashes), drops records with missing status labels (~6,170 rows), engineers 12 derived features (funding_age_days, time_to_first_funding, binary round flags, total_funding_types, highest_round), label-encodes categorical variables and standardizes numerical features. Data quality metrics are tracked at each step.

Star Schema Warehouse

The cleaned data is decomposed into a star schema: a central fact_startup table (one row per company, all numeric measures) surrounded by four dimension tables (dim_market, dim_country, dim_time, dim_stage). Three pre-aggregated OLAP cube faces are generated: Market×Year, Country×Stage and Market×Country. This design enables instant slice-and-dice operations on the frontend.

Classification: Success Prediction

Seven classification models predict startup outcome (acquired vs. closed) on ~5,500 labeled samples: Logistic Regression, Decision Tree, Random Forest, XGBoost, SVM (RBF kernel), K-Nearest Neighbors and Gaussian Naive Bayes. Class imbalance (60:40) is handled with SMOTE oversampling. Models are evaluated via stratified 5-fold cross-validation on accuracy, precision, recall, F1 and ROC-AUC.

Clustering: K-Means & DBSCAN

K-Means clustering segments 190+ markets by funding behavior (volume, deal size, success rate, rounds). The optimal k is determined by elbow method and silhouette scores. DBSCAN is applied at the company level to find density-based clusters and noise points in funding/round/age space. Isolation Forest (5% contamination) provides continuous anomaly scores for outlier detection.

Association Rules: Pattern Mining

The Apriori algorithm discovers co-occurrence patterns in binarized transaction data (funding types, markets, geographies, outcomes). Minimum support: 1%, minimum confidence: 50%. Rules are ranked by lift and filtered for status-as-consequent rules.

AI Analyst

The chatbot uses Google Gemini 2.0 Flash via client-side API calls. It receives a system prompt containing summary statistics from the dataset and is constrained to reference only the data available on this platform. No conversation data is stored. The API key is domain-restricted for deployment security.

Limitations

The dataset covers companies founded up to 2014. Funding amounts may be incomplete. Status labels are simplified (acquired ≠ always success, operating ≠ always healthy). US-centric bias (~53%) affects geographic generalizability. The chatbot can hallucinate despite system prompt constraints.

Tools & Libraries

Python 3.11pandasscikit-learnXGBoostmlxtendimbalanced-learnmatplotlibseabornNext.js 14TypeScriptTailwind CSSRechartsreact-simple-mapsGemini 2.0 FlashVercel