Core Framework#
repliclust.base#
Provides the core framework of repliclust.
An Archetype defines the overall geometry of a synthetic data set. Feeding one or several Archetypes into a DataGenerator allows you to sample synthetic data sets with the desired geometries.
- Functions:
set_seed()
Set a random seed for reproducibility.
get_supported_distributions()
Obtain a dictionary of supported probability distributions.
- Classes:
DataGenerator
Sample synthetic data sets based on data set archetypes.
Archetype
Sample probabilistic mixture models with a desired overall geometric structure.
MixtureModel
Probabilistic mixture model with defined cluster shapes, locations, and probability distributions.
DistributionMix
Mechanism for assigning probability distributions to clusters when sampling a
MixtureModel
via anArchetype
.SingleClusterDistribution
Define the probability distribution for a single cluster in a MixtureModel.
GroupSizeSampler
Sample the number of data points for each cluster.
ClusterCenterSampler
Sample the locations of cluster centers for a
MixtureModel
.CovarianceSampler
Sample cluster shapes for a
MixtureModel
.
- class repliclust.base.Archetype(n_clusters: int, dim: int, n_samples: int = 500, name=None, scale: float = 1.0, covariance_sampler: Optional[CovarianceSampler] = None, center_sampler: Optional[ClusterCenterSampler] = None, groupsize_sampler: Optional[GroupSizeSampler] = None, distribution_mix: Optional[DistributionMix] = None, **kwargs)#
Bases:
object
Base class for a data set archetype.
Objects of this class sample probabilistic mixture models by first sampling cluster shapes, then sampling the locations for all cluster centers, and finally assigning a probability distribution to each cluster.
Subclasses implement concrete ways of sampling probabilistic mixture models by providing a wrapper that runs this class’s constructor with certain choices for the covariance_sampler, center_sampler, groupsize_sampler, and distribution_mix parameters. Alternatively, it is possible to directly construct an Archetype object by manually specifying these parameters.
- Parameters:
n_clusters (int) – The desired number of clusters.
dim (int) – The desired number of dimensions.
n_samples (int, default=500) – The desired total number of data points.
name (str, optional) – The name of this archetype.
scale (float, default=1) – The typical length scale for clusters. Increasing this parameter makes all clusters bigger without changing their relatives sizes and positions. The default is 1.
covariance_sampler (
CovarianceSampler
) – Sampler for cluster covariances.center_sampler (
ClusterCenterSampler
) – Sampler for the locations of cluster centers.groupsize_sampler (
GroupSizeSampler
) – Sampler for the number of data points in each cluster.distribution_mix (
DistributionMix
) – Assigns probability distributions to clusters.**kwargs (dict, optional) – Extra arguments used by subclasses of
Archetype
to store additional attributes.
See also
MaxMinArchetype
:The default implementation for a dataset archetype.
- sample_mixture_model(quiet=False)#
Sample a probabilistic mixture model according to this archetype.
- Returns:
mixture_model – A probabilistic mixture model with the overall geometric structure specified by this archetype.
- Return type:
:py:class:MixtureModel
- synthesize(n_samples=None, quiet=False)#
Convenience method to create data from archetype directly, without having to instantiate a DataGenerator first.
Compared to other synthesis functions, this one returns only the data (X, y), since we already know what the archetype is.
- class repliclust.base.ClusterCenterSampler#
Bases:
object
Base class for sampling the locations of all cluster centers in a MixtureModel.
Subclasses implement a concrete way of sampling cluster centers by overriding the
sample_cluster_centers()
method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.See also
- sample_cluster_centers(archetype)#
Sample the locations of all clusters in a MixtureModel.
Subclasses overriding this method should follow the call signature below.
- class repliclust.base.CovarianceSampler#
Bases:
object
Base class for sampling the shapes of all clusters in a MixtureModel.
Subclasses implement a concrete way of sampling cluster shapes by overriding the
sample_covariances()
method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.See also
- sample_covariances(archetype)#
Sample cluster shapes for all clusters in a MixtureModel.
Subclasses overriding this method should follow the call signature below.
- Parameters:
archetype (
Archetype
) – Data set archetype specifying the desired overall geometry of a probabilistic mixture model.- Returns:
(axes_list, axis_lengths_list) – A tuple with two components. The first component, axes_list, is a list whose i-th entry stores the principal axes of cluster i as a matrix (each row is an axis). The second component, axis_lengths_list, is a list whose i-th entry stores the lengths of the i-th clusters principal axes as a vector (the j-th entry is the length of the principal axis stored in the j-th row of axes[i]).
- Return type:
tuple[list[:py:class: numpy.ndarray], list[
numpy.ndarray
]]
- class repliclust.base.DataGenerator(archetype, n_datasets=10, quiet=False, prefix='archetype')#
Bases:
object
Data generator based on data set archetypes.
Base class for a data generator. Instances of this class generate synthetic data sets based on archetypes indicating their desired geometries.
There are three different ways to generate synthetic data sets with a DataGenerator. After constructing a DataGenerator dg, you can write:
X, y, archetype = dg.synthesize(n_samples)
Generate a single data set with the desired number of samples.
for X, y, archetype in dg: ...
Iterate over dg and generate dg._n_datasets datasets, each with the number of samples specified by the corresponding archetype.
for X, y, archetype in dg(n_datasets, n_samples): ...
Iterate over dg and generate n_datasets datasets, each with n_samples data points if n_samples is a number; if n_samples is a list of n_datasets numbers, the i-th dataset will have n_samples[i] data points. If either n_datasets or n_samples are not specified, use n_datasets = dg._n_datasets and the number of data points specified by each archetype.
In each case, the output format is as follows: X is a matrix-shaped
ndarray
containing the data points (samples by variables) and y is a vector-shapedndarray
containing the cluster labels. Finally, archetype is the data set archetype from which the data set was generated.- Parameters:
archetype (Archetype or list[Archetype] or dict[str, Archetype]) – One or several archetypes specifying the desired overall geometry of synthetic data sets.
- synthesize(n_samples=None, quiet=None)#
Synthesize a data set according to the specified archetype(s). If this
DataGenerator
consists of more than one archetype, this function cycles through the given archetypes.- Parameters:
n_samples (int) – Desired total number of data points to sample. Optional. If specified, overrides the number of samples specified by an archetype object.
quiet (bool) – If true, suppress all print output. This option is useful when placing many successive calls to synthesize.
- Returns:
(X, y, archetype) – Tuple with three components. The first component, X, stores the new data set as a matrix (each row is a data point). The second component, y, stores the cluster labels (y[i] is the label of data point X[i,:]). The third component, archetype, is the data set archetype that was used to create X and y.
- Return type:
- class repliclust.base.DistributionMix#
Bases:
object
Base class for assigning probability distributions to all clusters in a MixtureModel.
Subclasses implement a concrete assignment mechanism by overriding the
assign_distributions()
method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.See also
- assign_distributions(n_clusters)#
Assign probability distributions to all clusters in a MixtureModel.
Subclasses overriding this method should follow the call signature below.
- Parameters:
n_clusters (int) – The number of clusters in the mixture model.
- Returns:
distributions – A list whose i-th element represents the probability distribution assigned to the i-th cluster.
- Return type:
list[
SingleClusterDistribution
]
- class repliclust.base.GroupSizeSampler#
Bases:
object
Base class for sampling the number of data points for each cluster in a MixtureModel.
Subclasses implement a concrete way of sampling group sizes by overriding the
sample_group_sizes()
method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.See also
- sample_group_sizes(archetype, total)#
Sample the number of data points for each cluster in a MixtureModel.
Subclasses overriding this method should follow the call signature below.
- Parameters:
archetype (
Archetype
) – Data set archetype specifying the desired overall geometry of a probabilistic mixture model.total (int) – The total number of samples (sum of all group sizes).
- Returns:
group_sizes – A vector whose i-th entry is the number of data points for the i-th cluster.
- Return type:
- class repliclust.base.MixtureModel(centers, axes_list, axis_lengths_list, distributions_list)#
Bases:
object
Represents a probabilistic mixture model from which you can draw samples.
- Parameters:
centers (
ndarray
) – The locations of the cluster centers in this mixture model, arranged as a matrix. The i-th row of this matrix stores the i-th cluster center.axes (list[
ndarray
]) – A list of the principal axes of each cluster. The i-th element is a matrix whose rows are the orthonormal axes of the i-th cluster.axis_lengths (list[
ndarray
]) – A list containing the lengths of the principal axes of each cluster. The i-th element is a vector whose j-th entry is the length of the j-th principal axis of cluster i.distributions (list[
SingleClusterDistribution
]) – A list assigning a probability distribution to each cluster in this mixture model. The i-th element is the probability distribution of the i-th cluster.
- sample_data(group_sizes)#
Sample a data set from this
MixtureModel
.- Parameters:
group_sizes (
ndarray
) – The number of data points to sample for each cluster, formatted as a vector whose length is the number of clusters in thisMixtureModel
.- Returns:
(X, y) – Tuple with two components. The first component, X’ is a matrix that stores the sampled data points (the `i-th row is the i-th data point), while the second component, y, is a vector that stores the cluster labels as integers ranging from zero to the number of clusters minus one.
- Return type:
- class repliclust.base.SingleClusterDistribution(**params)#
Bases:
object
Base class for specifying the probability distribution of a single cluster in a MixtureModel.
Subclasses implement a probability distribution by overriding the
_sample_1d()
method, which specifies a call signature that should be followed. By contrast, subclasses define their own attributes without restriction.See also
MultivariateNormal
Multivariate normal probability distribution for a single cluster.
Exponential
Exponential probability distribution for a single cluster.
DistributionFromNumPy
Arbitrary probability distribution from
numpy
for a single cluster.
- repliclust.base.get_supported_distributions()#
Get a dictionary of the currently supported probability distributions, as well as their default parameters. The names agree with the class names in the
numpy.random.Generator
module.
- repliclust.base.set_seed(seed)#
Set a program-wide seed for repliclust.
- Parameters:
seed (int) – Random seed.