DocumentationNeurondB Documentation

Clustering Algorithms

K-means Clustering

Group similar data points using unsupervised learning. NeuronDB supports K-means with k-means++ initialization. K-means partitions data into k clusters by minimizing within-cluster variance.

Method 1: Direct Clustering (Returns Table with Cluster Assignments)

Use neurondb.cluster_kmeans to directly get cluster assignments for all rows. This returns a table with id and cluster_id columns.

K-means clustering - direct method

-- Cluster data and get assignments
-- Returns: table with columns (id, cluster_id)
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_kmeans(
    'test_data',    -- table name (text)
    'features',     -- vector column name (text)
    3,              -- number of clusters k (integer)
    100             -- max iterations (integer)
)
GROUP BY cluster_id
ORDER BY cluster_id;

-- Get cluster assignments with original data
WITH clusters AS (
    SELECT * FROM neurondb.cluster_kmeans('test_data', 'features', 3, 100)
)
SELECT 
    t.id,
    t.features,
    c.cluster_id
FROM test_data t
JOIN clusters c ON t.id = c.id
ORDER BY c.cluster_id, t.id;

Function Signature:

neurondb.cluster_kmeans( table_name TEXT,    -- Source table name column_name TEXT,   -- Vector column name k INTEGER,          -- Number of clusters max_iter INTEGER    -- Maximum iterations ) RETURNS TABLE ( id INTEGER,         -- Row ID from source table cluster_id INTEGER  -- Assigned cluster (0 to k-1) )

Method 2: Train Model and Evaluate

Train a K-means model and save it for later use, then evaluate clustering quality.

Train K-means model

-- Train K-means model and get model_id
CREATE TEMP TABLE kmeans_model AS
SELECT train_kmeans_model_id(
    'test_data',    -- table name
    'features',     -- vector column
    3,              -- number of clusters k
    100             -- max iterations
) AS model_id;

-- View model_id
SELECT model_id FROM kmeans_model;

-- Evaluate model quality
CREATE TEMP TABLE kmeans_metrics AS
SELECT evaluate_kmeans_by_model_id(
    (SELECT model_id FROM kmeans_model),
    'test_data',    -- test table
    'features'      -- vector column
) AS metrics;

-- View evaluation metrics
SELECT
    'Inertia' AS metric, 
    ROUND((metrics->>'inertia')::numeric, 6)::text AS value
FROM kmeans_metrics
UNION ALL
SELECT 'N_Clusters', (metrics->>'n_clusters')::text
FROM kmeans_metrics
UNION ALL
SELECT 'N_Iterations', (metrics->>'n_iterations')::text
FROM kmeans_metrics
ORDER BY metric;

Function Signatures:

train_kmeans_model_id( table_name TEXT, column_name TEXT, k INTEGER, max_iter INTEGER ) RETURNS INTEGER  -- Returns model_id evaluate_kmeans_by_model_id( model_id INTEGER, table_name TEXT, column_name TEXT ) RETURNS JSONB     -- Returns metrics: inertia, n_clusters, n_iterations

DBSCAN

Density-based clustering for arbitrary shapes. Automatically detects noise (outliers) while grouping dense regions. Points not assigned to any cluster get cluster_id = -1 (noise).

DBSCAN clustering

-- DBSCAN clustering
-- Returns: table with columns (id, cluster_id)
-- cluster_id = -1 indicates noise/outlier points
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_dbscan(
    'test_data',    -- table name
    'features',     -- vector column
    1.0,            -- eps: maximum distance for neighbors
    2               -- min_pts: minimum points to form cluster
)
GROUP BY cluster_id
ORDER BY cluster_id;

-- Example: Tight eps creates more noise points
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_dbscan('test_data', 'features', 0.3, 2)
GROUP BY cluster_id
ORDER BY cluster_id;

-- Example: Larger eps groups more points
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_dbscan('test_data', 'features', 3.0, 2)
GROUP BY cluster_id
ORDER BY cluster_id;

Function Signature:

neurondb.cluster_dbscan( table_name TEXT,    -- Source table name column_name TEXT,   -- Vector column name eps REAL,           -- Maximum distance for neighbors min_pts INTEGER      -- Minimum points to form cluster ) RETURNS TABLE ( id INTEGER,         -- Row ID from source table cluster_id INTEGER  -- Cluster ID (-1 for noise/outliers) )

Parameters:

eps: Maximum distance between two points to be considered neighbors. Smaller values = tighter clusters, more noise points.
min_pts: Minimum number of points required to form a dense region (cluster). Higher values = fewer clusters, more noise.

Gaussian Mixture Models

Probabilistic clustering that models data as a mixture of Gaussian distributions. Each cluster is represented by a Gaussian distribution with mean and covariance.

GMM clustering

-- Gaussian Mixture Model clustering
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_gmm(
    'test_data',    -- table name
    'features',     -- vector column
    3,              -- n_components: number of Gaussian components
    50              -- max_iter: maximum EM iterations
)
GROUP BY cluster_id
ORDER BY cluster_id;

Function Signature:

neurondb.cluster_gmm( table_name TEXT,    -- Source table name column_name TEXT,   -- Vector column name n_components INTEGER, -- Number of Gaussian components max_iter INTEGER    -- Maximum EM algorithm iterations ) RETURNS TABLE ( id INTEGER,         -- Row ID from source table cluster_id INTEGER  -- Assigned cluster (0 to n_components-1) )

Mini-batch K-means

Faster variant of K-means that uses random mini-batches instead of the full dataset. Suitable for very large datasets.

Mini-batch K-means

-- Mini-batch K-means clustering
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_minibatch_kmeans(
    'test_data',    -- table name
    'features',     -- vector column
    3,              -- number of clusters k
    3,              -- batch_size: samples per batch
    50              -- max_iter: maximum iterations
)
GROUP BY cluster_id
ORDER BY cluster_id;

Function Signature:

neurondb.cluster_minibatch_kmeans( table_name TEXT, column_name TEXT, k INTEGER, batch_size INTEGER,  -- Number of samples per mini-batch max_iter INTEGER ) RETURNS TABLE (id INTEGER, cluster_id INTEGER)

Hierarchical Clustering

Builds a hierarchy of clusters using different linkage criteria (single, complete, average).

Hierarchical clustering

-- Hierarchical clustering with single linkage
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_hierarchical(
    'test_data',    -- table name
    'features',     -- vector column
    3,              -- number of clusters
    'single'        -- linkage: 'single', 'complete', or 'average'
)
GROUP BY cluster_id
ORDER BY cluster_id;

-- Complete linkage
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_hierarchical('test_data', 'features', 2, 'complete')
GROUP BY cluster_id
ORDER BY cluster_id;

-- Average linkage
SELECT 
    cluster_id, 
    COUNT(*) as cluster_size
FROM neurondb.cluster_hierarchical('test_data', 'features', 3, 'average')
GROUP BY cluster_id
ORDER BY cluster_id;

Function Signature:

neurondb.cluster_hierarchical( table_name TEXT, column_name TEXT, n_clusters INTEGER, linkage TEXT  -- 'single', 'complete', or 'average' ) RETURNS TABLE (id INTEGER, cluster_id INTEGER)

Clustering Quality Metrics

Evaluate clustering quality using the Davies-Bouldin Index. Lower values indicate better clustering.

Evaluate clustering quality

-- Calculate Davies-Bouldin Index for K-means results
WITH kmeans_results AS (
    SELECT * FROM neurondb.cluster_kmeans('test_data', 'features', 3, 100)
)
SELECT 
    neurondb.davies_bouldin_index(
        'test_data',           -- data table
        'features',            -- vector column
        'kmeans_results',      -- clustering results table
        'cluster_id'           -- cluster ID column
    ) AS db_index;  -- Lower is better

Function Signature:

neurondb.davies_bouldin_index( data_table TEXT,        -- Original data table vector_column TEXT,     -- Vector column name cluster_table TEXT,      -- Clustering results table cluster_column TEXT     -- Cluster ID column name ) RETURNS REAL  -- Davies-Bouldin Index (lower = better)

Next Steps

Classification - Supervised learning algorithms
Analytics Suite - Complete ML analytics
Performance - Optimize clustering performance

PreviousEmbeddings

NextClassification