DBSCAN Clustering

Density-based spatial clustering: finds clusters of arbitrary shape and identifies noise

Unlike K-Means which forces every point into a cluster, DBSCAN identifies dense regions as clusters and labels sparse points as noise. This is useful for funding data because startups don't form neat spherical groups: there are dense cores of typical companies and scattered outliers.

Parameters: eps=0.8 (neighborhood radius) and min_samples=10 (minimum points to form a cluster).

Clusters Found

Noise Points

279

0.8% of data

Total Companies

34,492

PCA Variance

83.3%

2 components

Noise (279)

Cluster 0 (18,964)

Cluster 1 (7,756)

Cluster 2 (1,852)

Cluster 3 (531)

Cluster 4 (3,660)

Cluster 5 (139)

Cluster 6 (948)

Cluster 7 (31)

Cluster 8 (68)

Cluster 9 (239)

Cluster 10 (25)

Outlier Detection

Isolation Forest: anomalous companies that don't fit normal patterns

Isolation Forest detects outliers by randomly partitioning data. Anomalies are isolated in fewer splits than normal points. Contamination is set to 5%: the expected proportion of outliers.

Outliers Found

2,024

Outlier Avg Funding

$162.3M

Normal Avg Funding

$6.8M

Ratio

23.7×

outlier vs normal

Outliers by Status

operating1,760

acquired221

closed43

Top Markets with Outliers

Biotechnology 277

Software 161

Clean Technology 119

Health Care 118

Mobile 75

Enterprise Software 70

E-Commerce 65

Advertising 57

Market↕	Country↕	Funding↕	Rounds↕	Status↕	Anomaly Score↑
Analytics	USA	$950.0M	12	operating	-0.208
Automotive	USA	$823.0M	11	operating	-0.204
File Sharing	USA	$564.1M	12	operating	-0.204
E-Commerce	USA	$516.9M	11	operating	-0.203
Communities	USA	$2.4B	11	operating	-0.202
Online Shopping	IND	$2.4B	11	operating	-0.202
Construction	USA	$1.0B	13	operating	-0.198
Automotive	USA	$1.5B	9	acquired	-0.198
E-Commerce	USA	$934.7M	10	operating	-0.196
Enterprise Software	NLD	$1.4B	9	operating	-0.194
Consumer Electronics	USA	$518.8M	11	operating	-0.192
Peer-to-Peer	USA	$566.2M	10	operating	-0.191
Manufacturing	USA	$1.6B	8	closed	-0.190
Information Technology	USA	$384.4M	9	acquired	-0.188
Solar	USA	$845.0M	9	operating	-0.188
Clean Technology	USA	$307.6M	11	operating	-0.187
Health Care	USA	$291.5M	9	operating	-0.187
Technology	USA	$866.6M	9	operating	-0.186
Software	USA	$290.0M	10	operating	-0.186
Analytics	USA	$1.2B	8	operating	-0.186

DBSCAN vs K-Means

K-Means (used on the Clusters page) assigns every point to exactly one of K clusters. It assumes clusters are spherical and roughly equal-sized. Good for market segmentation where you want clean groups.

DBSCAN doesn't require a pre-set number of clusters. It finds clusters of any shape and explicitly labels noise. Good for anomaly detection: the noise points are the interesting ones.

Isolation Forest is a dedicated anomaly detection method. It builds random trees and measures how quickly each point gets isolated. Faster isolation = more anomalous. Unlike DBSCAN's noise label, it provides a continuous anomaly score.