Natural Language-Based Synthetic Data Generation for Cluster Analysis#

repliclust is a Python package for generating synthetic datasets with clusters based on high-level descriptions. Instead of manually setting low-level parameters like cluster centroids or covariance matrices, you can simply describe the desired characteristics of your data, and repliclust will automatically generate datasets that match those specifications.

import repliclust as rpl

X, y, _ = rpl.generate("three highly separated oblong clusters in 10D")

Try our demo here!

Key Features#

Generate Data from High-Level Descriptions: Create datasets by specifying scenarios such as “clusters with very different shapes and sizes” or “highly overlapping oblong clusters.”
Data Set Archetypes: Use archetypes to define the overall geometry of your datasets with intuitive parameters that summarize cluster overlaps, shapes, sizes, and distributions.
Integration with Large Language Models (LLMs): Leverage LLMs to map verbal descriptions onto data set archetypes, enabling automated dataset generation from natural language inputs.
Flexible Cluster Shapes: Go beyond convex, blob-like clusters by applying nonlinear transformations, such as random neural networks for distortion or stereographic projections to create directional data.
Reproducible and Informative Benchmarks: Independently manipulate different aspects of the data to create benchmarks that effectively evaluate and compare clustering algorithms under various conditions.

Getting Started#

Installation: Follow the instructions in the Installation Guide to set up repliclust.
User Guide: Learn how to generate datasets from high-level descriptions in the User Guide.
API Reference: Explore the detailed API documentation in the Reference section.
Source Code: Visit our GitHub repository to view the source code and contribute.

Why Use repliclust?#

Simplify Synthetic Data Generation: Eliminate the need to fine-tune low-level simulation parameters. Describe your desired scenario, and let repliclust handle the rest.
Enhance Benchmark Quality: By controlling high-level aspects of the data, you can create more informative benchmarks that reveal the strengths and weaknesses of clustering algorithms under various conditions.
Accelerate Research: Quickly generate diverse datasets to test hypotheses, validate models, and perform robustness checks.

Reference#

For more details, check out our paper: Natural Language-Based Synthetic Data Generation for Cluster Analysis.