Skip to main content
The clustering module groups detected objects by their geometric features using K-Means. Optimal K is selected automatically via the elbow method (WCSS) with optional silhouette score validation.
from archeo_cluster.core.clustering import KMeansAnalyzer
from archeo_cluster.models import ClusteringConfig

KMeansAnalyzer

K-Means clustering analyzer for archaeological feature data.
analyzer = KMeansAnalyzer(
    config=ClusteringConfig(max_k=10, random_state=42),
    feature_columns=["area", "perimeter", "circularity", "aspect_ratio", "solidity", "extent"],
)

Constructor

config
ClusteringConfig
default:"ClusteringConfig()"
Clustering configuration. Controls the maximum K to evaluate, random seed, minimum samples per cluster, and whether silhouette scores are computed.
feature_columns
list[str]
Column names from the features CSV to use as input features for clustering. Centroid coordinates are excluded by default because they are spatial, not morphological.

process_features_csv

Process a features CSV file and perform per-image K-Means clustering. This is the primary entry point for the clustering stage.
batch = analyzer.process_features_csv(
    csv_path="./output/features.csv",
    output_dir="./clusters",
    images_dir="./output",   # optional, enables cluster overlay visualizations
)
print(f"{batch.total_clusters} clusters across {batch.image_count} images")
For each image the method:
  1. Runs the elbow method to find optimal K.
  2. Optionally runs silhouette analysis for validation.
  3. Fits K-Means and assigns cluster labels.
  4. Writes <image>_clustered.csv, elbow_method.png, silhouette_analysis.png, cluster_distribution.png, and morphological_scatter.png to output_dir/<image>/.
Parameters
csv_path
Path | str
required
Path to the features CSV produced by the detection stage. Must contain image_filename plus all columns listed in feature_columns.
output_dir
Path | str
required
Directory where per-image subdirectories of results are written. Created automatically if it does not exist.
images_dir
Path | str | None
default:"None"
Optional path to the directory of processed images saved by the detection stage. When provided, cluster overlay visualizations are written alongside the other outputs.
Returns BatchClusteringResult

cluster_single_image

Perform clustering on a single image’s feature DataFrame without any file I/O.
import pandas as pd

df = pd.read_csv("./output/features.csv")
image_df = df[df["image_filename"] == "photo"].copy()
result = analyzer.cluster_single_image(image_df, image_name="photo")
if result:
    print(f"Optimal K = {result.optimal_k}")
Parameters
df
pd.DataFrame
required
DataFrame containing feature rows for a single image. Must include all columns in feature_columns.
image_name
str
required
Name of the image used in logging and stored in the returned ClusteringResult.
Returns ClusteringResult | NoneNone if the image has fewer rows than config.min_samples_per_cluster.

ClusteringConfig

Pydantic model holding all clustering parameters.
from archeo_cluster.models import ClusteringConfig

config = ClusteringConfig(
    max_k=10,
    random_state=42,
    min_samples_per_cluster=2,
    compute_silhouette=True,
)
max_k
int
default:"10"
Maximum number of clusters to evaluate in the elbow method. Must be between 2 and 50 inclusive.
random_state
int
default:"42"
Random seed passed to K-Means for reproducible results.
min_samples_per_cluster
int
default:"2"
Minimum number of samples required to attempt clustering on an image. Images with fewer samples are skipped. Must be ≥ 1.
compute_silhouette
bool
default:"True"
When True, silhouette scores are computed as a complementary validation metric alongside the elbow method. Adds computation time but provides an independent K quality signal.

Return types

ClusteringResult

Result of K-Means clustering on a single image. This is a dataclass.
image_name
str
required
Name of the source image.
optimal_k
int
required
Optimal number of clusters determined by the elbow method.
labels
list[int]
default:"[]"
Cluster assignment (0-based index) for each object row in the input DataFrame.
clusters
list[ClusterInfo]
default:"[]"
Summary statistics for each cluster.
elbow_result
ElbowResult | None
default:"None"
Output from the elbow method analysis.
silhouette_result
SilhouetteResult | None
default:"None"
Output from silhouette score analysis. None when ClusteringConfig.compute_silhouette is False or computation failed.
cluster_count
int
Read-only property. Number of clusters in clusters.

BatchClusteringResult

Aggregated results from clustering all images in a features CSV. This is a dataclass.
results
list[ClusteringResult]
default:"[]"
One ClusteringResult per image that had enough samples to cluster.
image_count
int
Read-only property. Number of images in results.
total_clusters
int
Read-only property. Sum of cluster_count across all results.

ClusterInfo

Summary statistics for a single cluster. See the expandable in ClusteringResult.clusters above for field descriptions.

ElbowResult

Result of the elbow method analysis. See the expandable in ClusteringResult.elbow_result above.

SilhouetteResult

Result of silhouette score analysis. Silhouette coefficient ranges from -1 (poor separation) to 1 (perfect separation). The optimal K is selected by maximising the score. See the expandable in ClusteringResult.silhouette_result above for field descriptions.
Silhouette score is undefined for K=1 (only one cluster), so silhouette_scores[0] is always None when K=1 is included in the evaluated range.