Synthetic Data for Cluster Analysis#

repliclust is a Python package for generating synthetic datasets with clusters. It allows you to generate many different datasets that are geometrically similar, but without ever touching low-level parameters like cluster centroids or covariance matrices.

Features#

  • Reproducibly generate clusters with defined geometric characteristics

  • Manage cluster overlaps, shapes, and probability distributions through intuitive, high-level controls

  • Define custom dataset archetypes to power reproducible and informative benchmarks

Reference#

Check out our paper here.