DocumentationNeurondB Documentation

ML Analytics Suite

Clustering Algorithms

K-Means Clustering

Lloyd's K-Means with k-means++ initialization for finding customer segments, topic clusters, and data grouping.

K-Means clustering

-- K-Means clustering
SELECT cluster_kmeans(
  'train_data',   -- table with vectors
  'features',     -- vector column
  7,              -- K clusters
  50              -- max iterations
);

-- Project-based training and versioning
SELECT neurondb_train_kmeans_project(
  'fraud_kmeans',   -- project name
  'train_data',
  'features',
  7,
  50
) AS model_id;

-- List models for a project
SELECT version, algorithm, parameters, is_deployed
FROM neurondb_list_project_models('fraud_kmeans')
ORDER BY version;
  • Time Complexity: O(n·k·i·d)
  • Initialization: k-means++
  • Project Models: Versioned training runs

DBSCAN

Density-based clustering for arbitrary shapes. Automatically detects noise while grouping dense regions.

DBSCAN clustering

SELECT *
FROM cluster_dbscan(
  relation      => 'train_data',
  column_name   => 'features',
  eps           => 0.35,
  min_samples   => 12,
  distance      => 'cosine'
);
  • No need to specify cluster count. DBSCAN finds density-based groupings
  • Handles noise and outliers automatically
  • Works well with non-spherical clusters

Outlier Detection

Z-Score Outlier Detection

Z-score outliers

SELECT *
FROM detect_outliers_zscore(
  (SELECT embedding FROM documents),
  2.5  -- threshold
);

Isolation Forest

Isolation forest

SELECT *
FROM detect_outliers_isolation_forest(
  (SELECT embedding FROM documents),
  100  -- n_estimators
);

Dimensionality Reduction

PCA (Principal Component Analysis)

PCA

SELECT *
FROM reduce_pca(
  (SELECT embedding FROM documents),
  50  -- target dimensions
);

Next Steps