DBSCAN Clustering
Density-based spatial clustering: finds clusters of arbitrary shape and identifies noise
Unlike K-Means which forces every point into a cluster, DBSCAN identifies dense regions as clusters and labels sparse points as noise. This is useful for funding data because startups don't form neat spherical groups: there are dense cores of typical companies and scattered outliers.
Parameters: eps=0.8 (neighborhood radius) and min_samples=10 (minimum points to form a cluster).
Clusters Found
11
Noise Points
279
0.8% of data
Total Companies
34,492
PCA Variance
83.3%
2 components
Outlier Detection
Isolation Forest: anomalous companies that don't fit normal patterns
Isolation Forest detects outliers by randomly partitioning data. Anomalies are isolated in fewer splits than normal points. Contamination is set to 5%: the expected proportion of outliers.
Outliers Found
2,024
Outlier Avg Funding
$162.3M
Normal Avg Funding
$6.8M
Ratio
23.7×
outlier vs normal
Outliers by Status
Top Markets with Outliers
| Market↕ | Country↕ | Funding↕ | Rounds↕ | Status↕ | Anomaly Score↑ |
|---|---|---|---|---|---|
| Analytics | USA | $950.0M | 12 | operating | -0.208 |
| Automotive | USA | $823.0M | 11 | operating | -0.204 |
| File Sharing | USA | $564.1M | 12 | operating | -0.204 |
| E-Commerce | USA | $516.9M | 11 | operating | -0.203 |
| Communities | USA | $2.4B | 11 | operating | -0.202 |
| Online Shopping | IND | $2.4B | 11 | operating | -0.202 |
| Construction | USA | $1.0B | 13 | operating | -0.198 |
| Automotive | USA | $1.5B | 9 | acquired | -0.198 |
| E-Commerce | USA | $934.7M | 10 | operating | -0.196 |
| Enterprise Software | NLD | $1.4B | 9 | operating | -0.194 |
| Consumer Electronics | USA | $518.8M | 11 | operating | -0.192 |
| Peer-to-Peer | USA | $566.2M | 10 | operating | -0.191 |
| Manufacturing | USA | $1.6B | 8 | closed | -0.190 |
| Information Technology | USA | $384.4M | 9 | acquired | -0.188 |
| Solar | USA | $845.0M | 9 | operating | -0.188 |
| Clean Technology | USA | $307.6M | 11 | operating | -0.187 |
| Health Care | USA | $291.5M | 9 | operating | -0.187 |
| Technology | USA | $866.6M | 9 | operating | -0.186 |
| Software | USA | $290.0M | 10 | operating | -0.186 |
| Analytics | USA | $1.2B | 8 | operating | -0.186 |
DBSCAN vs K-Means
K-Means (used on the Clusters page) assigns every point to exactly one of K clusters. It assumes clusters are spherical and roughly equal-sized. Good for market segmentation where you want clean groups.
DBSCAN doesn't require a pre-set number of clusters. It finds clusters of any shape and explicitly labels noise. Good for anomaly detection: the noise points are the interesting ones.
Isolation Forest is a dedicated anomaly detection method. It builds random trees and measures how quickly each point gets isolated. Faster isolation = more anomalous. Unlike DBSCAN's noise label, it provides a continuous anomaly score.