KMeansAnalyzer
K-Means clustering analyzer for archaeological feature data.
Constructor
Clustering configuration. Controls the maximum K to evaluate, random seed, minimum samples per cluster, and whether silhouette scores are computed.
Column names from the features CSV to use as input features for clustering. Centroid coordinates are excluded by default because they are spatial, not morphological.
process_features_csv
Process a features CSV file and perform per-image K-Means clustering. This is the primary entry point for the clustering stage.
- Runs the elbow method to find optimal K.
- Optionally runs silhouette analysis for validation.
- Fits K-Means and assigns cluster labels.
- Writes
<image>_clustered.csv,elbow_method.png,silhouette_analysis.png,cluster_distribution.png, andmorphological_scatter.pngtooutput_dir/<image>/.
Path to the features CSV produced by the detection stage. Must contain
image_filename plus all columns listed in feature_columns.Directory where per-image subdirectories of results are written. Created automatically if it does not exist.
Optional path to the directory of processed images saved by the detection stage. When provided, cluster overlay visualizations are written alongside the other outputs.
BatchClusteringResult
cluster_single_image
Perform clustering on a single image’s feature DataFrame without any file I/O.
DataFrame containing feature rows for a single image. Must include all columns in
feature_columns.Name of the image used in logging and stored in the returned
ClusteringResult.ClusteringResult | None — None if the image has fewer rows than config.min_samples_per_cluster.
ClusteringConfig
Pydantic model holding all clustering parameters.
Maximum number of clusters to evaluate in the elbow method. Must be between 2 and 50 inclusive.
Random seed passed to K-Means for reproducible results.
Minimum number of samples required to attempt clustering on an image. Images with fewer samples are skipped. Must be ≥ 1.
When
True, silhouette scores are computed as a complementary validation metric alongside the elbow method. Adds computation time but provides an independent K quality signal.Return types
ClusteringResult
Result of K-Means clustering on a single image. This is a dataclass.
Name of the source image.
Optimal number of clusters determined by the elbow method.
Cluster assignment (0-based index) for each object row in the input DataFrame.
Summary statistics for each cluster.
Output from the elbow method analysis.
Output from silhouette score analysis.
None when ClusteringConfig.compute_silhouette is False or computation failed.Read-only property. Number of clusters in
clusters.BatchClusteringResult
Aggregated results from clustering all images in a features CSV. This is a dataclass.
One
ClusteringResult per image that had enough samples to cluster.Read-only property. Number of images in
results.Read-only property. Sum of
cluster_count across all results.ClusterInfo
Summary statistics for a single cluster. See the expandable in ClusteringResult.clusters above for field descriptions.
ElbowResult
Result of the elbow method analysis. See the expandable in ClusteringResult.elbow_result above.
SilhouetteResult
Result of silhouette score analysis. Silhouette coefficient ranges from -1 (poor separation) to 1 (perfect separation). The optimal K is selected by maximising the score.
See the expandable in ClusteringResult.silhouette_result above for field descriptions.
Silhouette score is undefined for K=1 (only one cluster), so
silhouette_scores[0] is always None when K=1 is included in the evaluated range.