Max-Min Implementation#

repliclust.maxmin#

This module provides functionality for implementing data set archetypes based on max-min sampling.

repliclust.maxmin.archetype#

This module implements a archetype for mixture models. The user chooses the desired geometry by setting the ratios between largest and smallest values of various geometric parameters.

class repliclust.maxmin.archetype.MaxMinArchetype(n_clusters=6, dim=2, n_samples=500, max_overlap=0.05, min_overlap=0.001, imbalance_ratio=2, aspect_maxmin=2, radius_maxmin=3, aspect_ref=1.5, name=None, scale=1.0, packing=0.1, distributions=['normal', 'exponential'], distribution_proportions=None, overlap_mode='auto', linear_penalty_weight=0.01, learning_rate='auto')#

Bases: Archetype

A dataset archetype that defines the overall geometry using max-min ratios.

The user sets the ratios between largest and smallest values of various geometric parameters.

Parameters:
  • n_clusters (int) – The desired number of clusters.

  • dim (int) – The desired number of dimensions.

  • n_samples (int) – Total number of samples in the dataset.

  • max_overlap (float, optional) – Maximum allowed overlap between any two clusters, as a fraction between 0 and 1. Default is 0.05.

  • min_overlap (float, optional) – Minimum required overlap between a cluster and some other cluster, as a fraction between 0 and 1. Default is 1e-3.

  • imbalance_ratio (float, optional) – Ratio between the largest and smallest group sizes among clusters. Must be >= 1. Default is 2.

  • aspect_maxmin (float, optional) – Ratio between the largest and smallest aspect ratios among clusters. Must be >= 1. Default is 2.

  • radius_maxmin (float, optional) – Ratio between the largest and smallest radii among clusters. Must be >= 1. Default is 3.

  • aspect_ref (float, optional) – Reference aspect ratio for clusters. Must be >= 1. Default is 1.5.

  • name (str, optional) – Name of the archetype. If None, a default name is assigned.

  • scale (float, optional) – Reference length scale for generated data. Default is 1.0.

  • packing (float, optional) – Packing density parameter affecting cluster placement. Default is 0.1.

  • distributions (list of str or tuple, optional) – Selection of probability distributions for the clusters. Default is [‘normal’, ‘exponential’].

  • distribution_proportions (list of float, optional) – Proportions of clusters that should have each distribution.

  • overlap_mode ({'auto', 'lda', 'c2c'}, optional) –

    Degree of precision when computing cluster overlaps: ‘lda’ is more exact

    than ‘c2c’ but more computationally expensive. Default is ‘auto’, which switches automatically switches from ‘lda’ to ‘c2c’.

  • linear_penalty_weight (float, optional) – Weight of the linear penalty in the overlap optimization. Default is 0.01.

  • learning_rate (float or 'auto', optional) – Learning rate for overlap optimization. If ‘auto’, it is set based on the dimensionality. Default is ‘auto’.

Notes

Glossary of geometric terms:

  • Group size : Number of data points in a cluster.

  • Cluster radius : Geometric mean of the standard deviations along a cluster’s principal axes.

  • Cluster aspect ratio : Ratio between the lengths of the longest and shortest principal axes of a cluster.

Examples

Create an archetype with default parameters:

>>> archetype = MaxMinArchetype()

Create an archetype with specific parameters:

>>> archetype = MaxMinArchetype(n_clusters=10, dim=5, aspect_ref=2.0)
create_another_like_this(suffix=None, **new_params)#

Create a new archetype similar to this one with updated parameters.

Parameters:
  • suffix (str, optional) – Suffix to append to the new archetype’s name.

  • **new_params – Arbitrary keyword arguments representing parameters to update.

Returns:

A new MaxMinArchetype instance with updated parameters.

Return type:

MaxMinArchetype

describe(exclude_internal_params=True)#

Get a dictionary describing the archetype’s parameters.

Returns:

A dictionary containing the archetype’s parameters.

Return type:

dict

edit_params(suffix=None, **params)#

Create a modified copy of this archetype with updated parameters.

Parameters:
  • suffix (str, optional) – Suffix to append to the archetype’s name. If None, the suffix ‘edited’ is used.

  • **params – Arbitrary keyword arguments representing parameters to update.

Returns:

A new MaxMinArchetype instance with updated parameters.

Return type:

MaxMinArchetype

static from_verbal_description(description: str, name=None, openai_api_key=None)#

Instantiate a MaxMinArchetype from a verbal description.

Parameters:
  • description (str) – Verbal description of the desired dataset archetype.

  • name (str, optional) – Name to assign to the new archetype. If None, a name is generated.

  • openai_api_key (str, optional) – OpenAI API key for accessing the language model. If None, the key is read from the configuration.

Returns:

A new MaxMinArchetype instance based on the verbal description.

Return type:

MaxMinArchetype

Raises:

Exception – If the OpenAI client cannot be initialized or the archetype cannot be created.

Notes

This method uses a language model to parse the verbal description and generate the archetype parameters.

guess_learning_rate(dim)#

Estimate an appropriate learning rate based on the dimensionality.

Parameters:

dim (int) – The dimensionality of the data.

Returns:

Suggested learning rate.

Return type:

float

sample_hyperparams(n=10, min_n_clusters=1, max_n_clusters=30, min_samples_per_cluster=10, max_samples_per_cluster=1000, min_dim=2, max_dim=50)#

Generate multiple copies of this archetype by sampling hyperparameters.

Parameters:
  • n (int, optional) – Number of archetype copies to generate. Default is 10.

  • min_n_clusters (int, optional) – Minimum number of clusters. Default is 1.

  • max_n_clusters (int, optional) – Maximum number of clusters. Default is 30.

  • min_samples_per_cluster (int, optional) – Minimum number of samples per cluster. Default is 10.

  • max_samples_per_cluster (int, optional) – Maximum number of samples per cluster. Default is 1000.

  • min_dim (int, optional) – Minimum dimensionality. Default is 2.

  • max_dim (int, optional) – Maximum dimensionality. Default is 50.

Returns:

A list of new MaxMinArchetype instances with sampled hyperparameters.

Return type:

list of MaxMinArchetype

Notes

The hyperparameters n_clusters, n_samples, and dim are sampled from Poisson distributions centered around the current archetype’s parameters.

repliclust.maxmin.archetype.parse_distribution_selection(distributions: list, proportions=None)#

Parse user selection of probability distributions.

Reformats the user-provided list of distributions and proportions into a format suitable for constructing a FixedProportionMix object.

Parameters:
  • distributions (list of str or tuple) – Selection of probability distributions to include in each mixture model. Each element is either: - A string representing the name of the distribution. - A tuple (name, params_dict), where name is the distribution name and params_dict is a dictionary of distribution parameters.

  • proportions (list of float, optional) – Proportions of clusters that should have each distribution. If None, distributions are equally weighted.

Returns:

A list suitable for constructing a FixedProportionMix object, where each element is a tuple (name, proportion, params_dict).

Return type:

list of tuple

Raises:

ValueError – If distributions is not in the expected format.

Notes

To print all valid distribution names, call repliclust.print_supported_distributions().

repliclust.maxmin.archetype.validate_archetype_args(**args)#

Validate all provided arguments for a MaxMinArchetype.

This function checks the validity of overlap parameters, max-min ratios, and reference quantities.

Parameters:

**args – Arbitrary keyword arguments containing parameters to validate.

Raises:

ValueError – If any of the parameters fail their respective validations.

repliclust.maxmin.archetype.validate_maxmin_ratios(maxmin_ratio=2, arg_name='aspect_maxmin', underlying_param='aspect ratio')#

Check that a max-min ratio is >= 1.

repliclust.maxmin.archetype.validate_overlaps(max_overlap=0.05, min_overlap=0.001)#

Note that we allow max_overlap=1 and min_overlap=0, which should have the effect of removing one or both overlap constraints.

repliclust.maxmin.archetype.validate_reference_quantity(ref_qty=1.5, min_allowed_value=1, name='aspect_ref')#

Check that a reference value exceeds its minimum allowed value.

repliclust.maxmin.covariance#

This module provides a class for sampling cluster covariances using the max-min approach.

class repliclust.maxmin.covariance.MaxMinCovarianceSampler(aspect_ref=1.5, aspect_maxmin=2, radius_maxmin=2)#

Bases: object

Sample covariances for the clusters in a mixture model by specifying the max-min ratios for various geometric parameters.

See documentation of class MaxMinArchetype for more information.

aspect_ref#

Reference aspect ratio for clusters in the mixture model.

Type:

float, >= 1

aspect_maxmin#

Max-min ratio for the aspect ratios of clusters in the mixture model.

Type:

float, >= 1

radius_maxmin#

Max-min ratio for the radii of clusters in the mixture model.

Type:

float, >= 1

make_axis_lengths(n_axes, reference_length, aspect_ratio)#

Sample the lengths of all principal axes for a single cluster.

Parameters:
  • n_axes (int) – The number of principal axes (same as the dimensionality).

  • reference_length (float) – Desired geometric mean of the lengths.

  • aspect_ratio (float) – Desired ratio between longest and shortest lengths.

Returns:

lengths – Lengths of the principal axes for this cluster.

Return type:

ndarray

make_cluster_aspect_ratios(n_clusters)#

Sample aspect ratios for all clusters.

The aspect ratio of a cluster measures how oblong/ellipsoidal the cluster is. It is defined as the ratio between the lengths of the cluster’s longest and shortest principal axes.

Parameters:

n_clusters (int) – The number of clusters.

Returns:

out – The aspect ratios for each cluster.

Return type:

ndarray

make_cluster_radii(n_clusters, ref_radius, dim)#

Sample cluster radii using pairwise max-min sampling.

Sampling constrains the arithmetic mean of cluster volumes to equal the reference volume (namely ref_radius**dim power). The minimum and maximum cluster radii of the resulting sample average to the reference radius.

Parameters:
  • n_clusters (int) – The number of clusters.

  • ref_radius (float) – The reference radius for the clusters.

  • dim (int) – The number of dimensions.

Returns:

radii – Radii for all the clusters.

Return type:

ndarray

sample_covariances(archetype)#

Compute the principal axes and their lengths for each cluster in a mixture model.

Parameters:

archetype (Archetype) – Archetype for a mixture model.

Returns:

(axes_list, axis_lengths_list) – Tuple with two components. The first component, axes_list, is a list whose i-th element stores the principal axes of the i-th cluster as a matrix (each row is an axis). The second component, axis_lengths_list, is a list whose i-th element stores the lengths of the i-th cluster’s principal axes as a vector. In particular, for any cluster i and axis j, the number axis_lengths_list[i][j] is the length corresponding to the principal axis axes_list[i][j,:].

Return type:

Tuple[List[ndarray], List[ndarray]]

validate_k(n_clusters)#

Make sure the number of clusters is valid.

repliclust.maxmin.groupsizes#

This module provides functionality for sampling the number of data points in each cluster using a max-min approach.

class repliclust.maxmin.groupsizes.MaxMinGroupSizeSampler(imbalance_ratio=2)#

Bases: GroupSizeSampler

Sample the number of data points in each cluster using pairwise max-min sampling.

imbalance_ratio#

The desired ratio between largest and smallest group size.

Type:

float, >=1

__init__(self, imbalance_ratio)#
make_group_sizes(self, clusterdata)#
sample_group_sizes(archetype, total)#

Sample the number of data points for each cluster using pairwise max-min sampling.

Parameters:
  • archetype (Blueprint) – Blueprint for a mixture model.

  • total (int) – The total number of data points (sum of group sizes).

Returns:

group_sizes – The number of data points for each cluster.

Return type:

ndarray