Clustering Algorithms
K-means Clustering
Group similar data points using unsupervised learning. NeuronDB supports K-means with k-means++ initialization. K-means partitions data into k clusters by minimizing within-cluster variance.
Method 1: Direct Clustering (Returns Table with Cluster Assignments)
Use neurondb.cluster_kmeans to directly get cluster assignments for all rows. This returns a table with id and cluster_id columns.
K-means clustering - direct method
-- Cluster data and get assignments
-- Returns: table with columns (id, cluster_id)
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_kmeans(
'test_data', -- table name (text)
'features', -- vector column name (text)
3, -- number of clusters k (integer)
100 -- max iterations (integer)
)
GROUP BY cluster_id
ORDER BY cluster_id;
-- Get cluster assignments with original data
WITH clusters AS (
SELECT * FROM neurondb.cluster_kmeans('test_data', 'features', 3, 100)
)
SELECT
t.id,
t.features,
c.cluster_id
FROM test_data t
JOIN clusters c ON t.id = c.id
ORDER BY c.cluster_id, t.id;Function Signature:
neurondb.cluster_kmeans( table_name TEXT, -- Source table name column_name TEXT, -- Vector column name k INTEGER, -- Number of clusters max_iter INTEGER -- Maximum iterations ) RETURNS TABLE ( id INTEGER, -- Row ID from source table cluster_id INTEGER -- Assigned cluster (0 to k-1) )Method 2: Train Model and Evaluate
Train a K-means model and save it for later use, then evaluate clustering quality.
Train K-means model
-- Train K-means model and get model_id
CREATE TEMP TABLE kmeans_model AS
SELECT train_kmeans_model_id(
'test_data', -- table name
'features', -- vector column
3, -- number of clusters k
100 -- max iterations
) AS model_id;
-- View model_id
SELECT model_id FROM kmeans_model;
-- Evaluate model quality
CREATE TEMP TABLE kmeans_metrics AS
SELECT evaluate_kmeans_by_model_id(
(SELECT model_id FROM kmeans_model),
'test_data', -- test table
'features' -- vector column
) AS metrics;
-- View evaluation metrics
SELECT
'Inertia' AS metric,
ROUND((metrics->>'inertia')::numeric, 6)::text AS value
FROM kmeans_metrics
UNION ALL
SELECT 'N_Clusters', (metrics->>'n_clusters')::text
FROM kmeans_metrics
UNION ALL
SELECT 'N_Iterations', (metrics->>'n_iterations')::text
FROM kmeans_metrics
ORDER BY metric;Function Signatures:
train_kmeans_model_id( table_name TEXT, column_name TEXT, k INTEGER, max_iter INTEGER ) RETURNS INTEGER -- Returns model_id evaluate_kmeans_by_model_id( model_id INTEGER, table_name TEXT, column_name TEXT ) RETURNS JSONB -- Returns metrics: inertia, n_clusters, n_iterationsDBSCAN
Density-based clustering for arbitrary shapes. Automatically detects noise (outliers) while grouping dense regions. Points not assigned to any cluster get cluster_id = -1 (noise).
DBSCAN clustering
-- DBSCAN clustering
-- Returns: table with columns (id, cluster_id)
-- cluster_id = -1 indicates noise/outlier points
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_dbscan(
'test_data', -- table name
'features', -- vector column
1.0, -- eps: maximum distance for neighbors
2 -- min_pts: minimum points to form cluster
)
GROUP BY cluster_id
ORDER BY cluster_id;
-- Example: Tight eps creates more noise points
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_dbscan('test_data', 'features', 0.3, 2)
GROUP BY cluster_id
ORDER BY cluster_id;
-- Example: Larger eps groups more points
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_dbscan('test_data', 'features', 3.0, 2)
GROUP BY cluster_id
ORDER BY cluster_id;Function Signature:
neurondb.cluster_dbscan( table_name TEXT, -- Source table name column_name TEXT, -- Vector column name eps REAL, -- Maximum distance for neighbors min_pts INTEGER -- Minimum points to form cluster ) RETURNS TABLE ( id INTEGER, -- Row ID from source table cluster_id INTEGER -- Cluster ID (-1 for noise/outliers) )Parameters:
eps: Maximum distance between two points to be considered neighbors. Smaller values = tighter clusters, more noise points.min_pts: Minimum number of points required to form a dense region (cluster). Higher values = fewer clusters, more noise.
Gaussian Mixture Models
Probabilistic clustering that models data as a mixture of Gaussian distributions. Each cluster is represented by a Gaussian distribution with mean and covariance.
GMM clustering
-- Gaussian Mixture Model clustering
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_gmm(
'test_data', -- table name
'features', -- vector column
3, -- n_components: number of Gaussian components
50 -- max_iter: maximum EM iterations
)
GROUP BY cluster_id
ORDER BY cluster_id;Function Signature:
neurondb.cluster_gmm( table_name TEXT, -- Source table name column_name TEXT, -- Vector column name n_components INTEGER, -- Number of Gaussian components max_iter INTEGER -- Maximum EM algorithm iterations ) RETURNS TABLE ( id INTEGER, -- Row ID from source table cluster_id INTEGER -- Assigned cluster (0 to n_components-1) )Mini-batch K-means
Faster variant of K-means that uses random mini-batches instead of the full dataset. Suitable for very large datasets.
Mini-batch K-means
-- Mini-batch K-means clustering
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_minibatch_kmeans(
'test_data', -- table name
'features', -- vector column
3, -- number of clusters k
3, -- batch_size: samples per batch
50 -- max_iter: maximum iterations
)
GROUP BY cluster_id
ORDER BY cluster_id;Function Signature:
neurondb.cluster_minibatch_kmeans( table_name TEXT, column_name TEXT, k INTEGER, batch_size INTEGER, -- Number of samples per mini-batch max_iter INTEGER ) RETURNS TABLE (id INTEGER, cluster_id INTEGER)Hierarchical Clustering
Builds a hierarchy of clusters using different linkage criteria (single, complete, average).
Hierarchical clustering
-- Hierarchical clustering with single linkage
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_hierarchical(
'test_data', -- table name
'features', -- vector column
3, -- number of clusters
'single' -- linkage: 'single', 'complete', or 'average'
)
GROUP BY cluster_id
ORDER BY cluster_id;
-- Complete linkage
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_hierarchical('test_data', 'features', 2, 'complete')
GROUP BY cluster_id
ORDER BY cluster_id;
-- Average linkage
SELECT
cluster_id,
COUNT(*) as cluster_size
FROM neurondb.cluster_hierarchical('test_data', 'features', 3, 'average')
GROUP BY cluster_id
ORDER BY cluster_id;Function Signature:
neurondb.cluster_hierarchical( table_name TEXT, column_name TEXT, n_clusters INTEGER, linkage TEXT -- 'single', 'complete', or 'average' ) RETURNS TABLE (id INTEGER, cluster_id INTEGER)Clustering Quality Metrics
Evaluate clustering quality using the Davies-Bouldin Index. Lower values indicate better clustering.
Evaluate clustering quality
-- Calculate Davies-Bouldin Index for K-means results
WITH kmeans_results AS (
SELECT * FROM neurondb.cluster_kmeans('test_data', 'features', 3, 100)
)
SELECT
neurondb.davies_bouldin_index(
'test_data', -- data table
'features', -- vector column
'kmeans_results', -- clustering results table
'cluster_id' -- cluster ID column
) AS db_index; -- Lower is betterFunction Signature:
neurondb.davies_bouldin_index( data_table TEXT, -- Original data table vector_column TEXT, -- Vector column name cluster_table TEXT, -- Clustering results table cluster_column TEXT -- Cluster ID column name ) RETURNS REAL -- Davies-Bouldin Index (lower = better)Next Steps
- Classification - Supervised learning algorithms
- Analytics Suite - Complete ML analytics
- Performance - Optimize clustering performance