Core Framework#

repliclust.base#

Provides the core framework of repliclust.

An Archetype defines the overall geometry of a synthetic data set. Feeding one or several Archetypes into a DataGenerator allows you to sample synthetic data sets with the desired geometries.

Functions:
set_seed()

Set a random seed for reproducibility.

get_supported_distributions()

Obtain a dictionary of supported probability distributions.

Classes:
DataGenerator

Sample synthetic data sets based on data set archetypes.

Archetype

Sample probabilistic mixture models with a desired overall geometric structure.

MixtureModel

Probabilistic mixture model with defined cluster shapes, locations, and probability distributions.

DistributionMix

Mechanism for assigning probability distributions to clusters when sampling a MixtureModel via an Archetype.

SingleClusterDistribution

Define the probability distribution for a single cluster in a MixtureModel.

GroupSizeSampler

Sample the number of data points for each cluster.

ClusterCenterSampler

Sample the locations of cluster centers for a MixtureModel.

CovarianceSampler

Sample cluster shapes for a MixtureModel.

class repliclust.base.Archetype(n_clusters: int, dim: int, n_samples: int = 500, name=None, scale: float = 1.0, covariance_sampler: Optional[CovarianceSampler] = None, center_sampler: Optional[ClusterCenterSampler] = None, groupsize_sampler: Optional[GroupSizeSampler] = None, distribution_mix: Optional[DistributionMix] = None, **kwargs)#

Bases: object

Base class for a data set archetype.

Objects of this class sample probabilistic mixture models by first sampling cluster shapes, then sampling the locations for all cluster centers, and finally assigning a probability distribution to each cluster.

Subclasses implement concrete ways of sampling probabilistic mixture models by providing a wrapper that runs this class’s constructor with certain choices for the covariance_sampler, center_sampler, groupsize_sampler, and distribution_mix parameters. Alternatively, it is possible to directly construct an Archetype object by manually specifying these parameters.

Parameters:
  • n_clusters (int) – The desired number of clusters.

  • dim (int) – The desired number of dimensions.

  • n_samples (int, default=500) – The desired total number of data points.

  • name (str, optional) – The name of this archetype.

  • scale (float, default=1) – The typical length scale for clusters. Increasing this parameter makes all clusters bigger without changing their relatives sizes and positions. The default is 1.

  • covariance_sampler (CovarianceSampler) – Sampler for cluster covariances.

  • center_sampler (ClusterCenterSampler) – Sampler for the locations of cluster centers.

  • groupsize_sampler (GroupSizeSampler) – Sampler for the number of data points in each cluster.

  • distribution_mix (DistributionMix) – Assigns probability distributions to clusters.

  • **kwargs (dict, optional) – Extra arguments used by subclasses of Archetype to store additional attributes.

See also

MaxMinArchetype :

The default implementation for a dataset archetype.

sample_mixture_model(quiet=False)#

Sample a probabilistic mixture model according to this archetype.

Returns:

mixture_model – A probabilistic mixture model with the overall geometric structure specified by this archetype.

Return type:

:py:class:MixtureModel

synthesize(n_samples=None, quiet=False)#

Convenience method to create data from archetype directly, without having to instantiate a DataGenerator first.

Compared to other synthesis functions, this one returns only the data (X, y), since we already know what the archetype is.

class repliclust.base.ClusterCenterSampler#

Bases: object

Base class for sampling the locations of all cluster centers in a MixtureModel.

Subclasses implement a concrete way of sampling cluster centers by overriding the sample_cluster_centers() method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.

sample_cluster_centers(archetype)#

Sample the locations of all clusters in a MixtureModel.

Subclasses overriding this method should follow the call signature below.

Parameters:

archetype (Archetype) – Data set archetype specifying the desired overall geometry of a probabilistic mixture model.

Returns:

centers – A matrix whose i-th row gives the location of the i-th cluster in the mixture model.

Return type:

ndarray

class repliclust.base.CovarianceSampler#

Bases: object

Base class for sampling the shapes of all clusters in a MixtureModel.

Subclasses implement a concrete way of sampling cluster shapes by overriding the sample_covariances() method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.

sample_covariances(archetype)#

Sample cluster shapes for all clusters in a MixtureModel.

Subclasses overriding this method should follow the call signature below.

Parameters:

archetype (Archetype) – Data set archetype specifying the desired overall geometry of a probabilistic mixture model.

Returns:

(axes_list, axis_lengths_list) – A tuple with two components. The first component, axes_list, is a list whose i-th entry stores the principal axes of cluster i as a matrix (each row is an axis). The second component, axis_lengths_list, is a list whose i-th entry stores the lengths of the i-th clusters principal axes as a vector (the j-th entry is the length of the principal axis stored in the j-th row of axes[i]).

Return type:

tuple[list[:py:class: numpy.ndarray], list[numpy.ndarray]]

class repliclust.base.DataGenerator(archetype, n_datasets=10, quiet=False, prefix='archetype')#

Bases: object

Data generator based on data set archetypes.

Base class for a data generator. Instances of this class generate synthetic data sets based on archetypes indicating their desired geometries.

There are three different ways to generate synthetic data sets with a DataGenerator. After constructing a DataGenerator dg, you can write:

  1. X, y, archetype = dg.synthesize(n_samples)

    Generate a single data set with the desired number of samples.

  2. for X, y, archetype in dg: ...

    Iterate over dg and generate dg._n_datasets datasets, each with the number of samples specified by the corresponding archetype.

  3. for X, y, archetype in dg(n_datasets, n_samples): ...

    Iterate over dg and generate n_datasets datasets, each with n_samples data points if n_samples is a number; if n_samples is a list of n_datasets numbers, the i-th dataset will have n_samples[i] data points. If either n_datasets or n_samples are not specified, use n_datasets = dg._n_datasets and the number of data points specified by each archetype.

In each case, the output format is as follows: X is a matrix-shaped ndarray containing the data points (samples by variables) and y is a vector-shaped ndarray containing the cluster labels. Finally, archetype is the data set archetype from which the data set was generated.

Parameters:

archetype (Archetype or list[Archetype] or dict[str, Archetype]) – One or several archetypes specifying the desired overall geometry of synthetic data sets.

synthesize(n_samples=None, quiet=None)#

Synthesize a data set according to the specified archetype(s). If this DataGenerator consists of more than one archetype, this function cycles through the given archetypes.

Parameters:
  • n_samples (int) – Desired total number of data points to sample. Optional. If specified, overrides the number of samples specified by an archetype object.

  • quiet (bool) – If true, suppress all print output. This option is useful when placing many successive calls to synthesize.

Returns:

(X, y, archetype) – Tuple with three components. The first component, X, stores the new data set as a matrix (each row is a data point). The second component, y, stores the cluster labels (y[i] is the label of data point X[i,:]). The third component, archetype, is the data set archetype that was used to create X and y.

Return type:

tuple[ndarray, ndarray, Archetype]

class repliclust.base.DistributionMix#

Bases: object

Base class for assigning probability distributions to all clusters in a MixtureModel.

Subclasses implement a concrete assignment mechanism by overriding the assign_distributions() method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.

assign_distributions(n_clusters)#

Assign probability distributions to all clusters in a MixtureModel.

Subclasses overriding this method should follow the call signature below.

Parameters:

n_clusters (int) – The number of clusters in the mixture model.

Returns:

distributions – A list whose i-th element represents the probability distribution assigned to the i-th cluster.

Return type:

list[ SingleClusterDistribution]

class repliclust.base.GroupSizeSampler#

Bases: object

Base class for sampling the number of data points for each cluster in a MixtureModel.

Subclasses implement a concrete way of sampling group sizes by overriding the sample_group_sizes() method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.

sample_group_sizes(archetype, total)#

Sample the number of data points for each cluster in a MixtureModel.

Subclasses overriding this method should follow the call signature below.

Parameters:
  • archetype (Archetype) – Data set archetype specifying the desired overall geometry of a probabilistic mixture model.

  • total (int) – The total number of samples (sum of all group sizes).

Returns:

group_sizes – A vector whose i-th entry is the number of data points for the i-th cluster.

Return type:

ndarray

class repliclust.base.MixtureModel(centers, axes_list, axis_lengths_list, distributions_list)#

Bases: object

Represents a probabilistic mixture model from which you can draw samples.

Parameters:
  • centers (ndarray) – The locations of the cluster centers in this mixture model, arranged as a matrix. The i-th row of this matrix stores the i-th cluster center.

  • axes (list[ndarray]) – A list of the principal axes of each cluster. The i-th element is a matrix whose rows are the orthonormal axes of the i-th cluster.

  • axis_lengths (list[ndarray]) – A list containing the lengths of the principal axes of each cluster. The i-th element is a vector whose j-th entry is the length of the j-th principal axis of cluster i.

  • distributions (list[SingleClusterDistribution]) – A list assigning a probability distribution to each cluster in this mixture model. The i-th element is the probability distribution of the i-th cluster.

sample_data(group_sizes)#

Sample a data set from this MixtureModel.

Parameters:

group_sizes (ndarray) – The number of data points to sample for each cluster, formatted as a vector whose length is the number of clusters in this MixtureModel.

Returns:

(X, y) – Tuple with two components. The first component, X’ is a matrix that stores the sampled data points (the `i-th row is the i-th data point), while the second component, y, is a vector that stores the cluster labels as integers ranging from zero to the number of clusters minus one.

Return type:

tuple[ndarray, ndarray]

class repliclust.base.SingleClusterDistribution(**params)#

Bases: object

Base class for specifying the probability distribution of a single cluster in a MixtureModel.

Subclasses implement a probability distribution by overriding the _sample_1d() method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.

See also

MultivariateNormal

Multivariate normal probability distribution for a single cluster.

Exponential

Exponential probability distribution for a single cluster.

DistributionFromNumPy

Arbitrary probability distribution from numpy for a single cluster.

sample_cluster(n: int, center: ndarray, axes: ndarray, axis_lengths: ndarray)#

Sample data points for a single cluster.

Parameters:
  • n (int) – The number of data points to generate.

  • center (ndarray) – The cluster center.

Returns:

X – Data points for a single cluster, arranged as a matrix with n rows (each row is a single data point).

Return type:

ndarray

repliclust.base.get_supported_distributions()#

Get a dictionary of the currently supported probability distributions, as well as their default parameters. The names agree with the class names in the numpy.random.Generator module.

repliclust.base.set_seed(seed)#

Set a program-wide seed for repliclust.

Parameters:

seed (int) – Random seed.