User Guide#

Generating synthetic data with repliclust is different from the experience you may have had with other cluster generators. Our software is based on data set archetypes, high-level geometric descriptions of whole classes of data sets.

More specifically, an archetype represents a probability distribution over mixture models with similar geometry (same number of clusters, overlaps between clusters, cluster probability distributions, …). To generate individual synthetic data sets, repliclust first generates a probabilistic mixture model based on the data set archetype. This mixture model, in turn, samples the actual data set. The figure below sketches our workflow.


The following brief tutorials demonstrate how to use repliclust. The section Basic Usage will get you started generating your own synthetic data sets within minutes.