Max-Min Implementation#
repliclust.maxmin#
This module provides functionality for implementing data set archetypes based on max-min sampling.
repliclust.maxmin.archetype#
This module implements a archetype for mixture models. The user chooses the desired geometry by setting the ratios between largest and smallest values of various geometric parameters.
- class repliclust.maxmin.archetype.MaxMinArchetype(n_clusters=6, dim=2, n_samples=500, max_overlap=0.05, min_overlap=0.001, imbalance_ratio=2, aspect_maxmin=2, radius_maxmin=3, aspect_ref=1.5, name=None, scale=1.0, packing=0.1, distributions=['normal', 'exponential'], distribution_proportions=None, overlap_mode='auto', linear_penalty_weight=0.01, learning_rate='auto')#
Bases:
Archetype
A dataset archetype that defines the overall geometry using max-min ratios.
The user sets the ratios between largest and smallest values of various geometric parameters.
- Parameters:
n_clusters (int) – The desired number of clusters.
dim (int) – The desired number of dimensions.
n_samples (int) – Total number of samples in the dataset.
max_overlap (float, optional) – Maximum allowed overlap between any two clusters, as a fraction between 0 and 1. Default is 0.05.
min_overlap (float, optional) – Minimum required overlap between a cluster and some other cluster, as a fraction between 0 and 1. Default is 1e-3.
imbalance_ratio (float, optional) – Ratio between the largest and smallest group sizes among clusters. Must be >= 1. Default is 2.
aspect_maxmin (float, optional) – Ratio between the largest and smallest aspect ratios among clusters. Must be >= 1. Default is 2.
radius_maxmin (float, optional) – Ratio between the largest and smallest radii among clusters. Must be >= 1. Default is 3.
aspect_ref (float, optional) – Reference aspect ratio for clusters. Must be >= 1. Default is 1.5.
name (str, optional) – Name of the archetype. If None, a default name is assigned.
scale (float, optional) – Reference length scale for generated data. Default is 1.0.
packing (float, optional) – Packing density parameter affecting cluster placement. Default is 0.1.
distributions (list of str or tuple, optional) – Selection of probability distributions for the clusters. Default is [‘normal’, ‘exponential’].
distribution_proportions (list of float, optional) – Proportions of clusters that should have each distribution.
overlap_mode ({'auto', 'lda', 'c2c'}, optional) –
- Degree of precision when computing cluster overlaps: ‘lda’ is more exact
than ‘c2c’ but more computationally expensive. Default is ‘auto’, which switches automatically switches from ‘lda’ to ‘c2c’.
linear_penalty_weight (float, optional) – Weight of the linear penalty in the overlap optimization. Default is 0.01.
learning_rate (float or 'auto', optional) – Learning rate for overlap optimization. If ‘auto’, it is set based on the dimensionality. Default is ‘auto’.
Notes
Glossary of geometric terms:
Group size : Number of data points in a cluster.
Cluster radius : Geometric mean of the standard deviations along a cluster’s principal axes.
Cluster aspect ratio : Ratio between the lengths of the longest and shortest principal axes of a cluster.
Examples
Create an archetype with default parameters:
>>> archetype = MaxMinArchetype()
Create an archetype with specific parameters:
>>> archetype = MaxMinArchetype(n_clusters=10, dim=5, aspect_ref=2.0)
- create_another_like_this(suffix=None, **new_params)#
Create a new archetype similar to this one with updated parameters.
- Parameters:
suffix (str, optional) – Suffix to append to the new archetype’s name.
**new_params – Arbitrary keyword arguments representing parameters to update.
- Returns:
A new MaxMinArchetype instance with updated parameters.
- Return type:
- describe(exclude_internal_params=True)#
Get a dictionary describing the archetype’s parameters.
- Returns:
A dictionary containing the archetype’s parameters.
- Return type:
dict
- edit_params(suffix=None, **params)#
Create a modified copy of this archetype with updated parameters.
- Parameters:
suffix (str, optional) – Suffix to append to the archetype’s name. If None, the suffix ‘edited’ is used.
**params – Arbitrary keyword arguments representing parameters to update.
- Returns:
A new MaxMinArchetype instance with updated parameters.
- Return type:
- static from_verbal_description(description: str, name=None, openai_api_key=None)#
Instantiate a MaxMinArchetype from a verbal description.
- Parameters:
description (str) – Verbal description of the desired dataset archetype.
name (str, optional) – Name to assign to the new archetype. If None, a name is generated.
openai_api_key (str, optional) – OpenAI API key for accessing the language model. If None, the key is read from the configuration.
- Returns:
A new MaxMinArchetype instance based on the verbal description.
- Return type:
- Raises:
Exception – If the OpenAI client cannot be initialized or the archetype cannot be created.
Notes
This method uses a language model to parse the verbal description and generate the archetype parameters.
- guess_learning_rate(dim)#
Estimate an appropriate learning rate based on the dimensionality.
- Parameters:
dim (int) – The dimensionality of the data.
- Returns:
Suggested learning rate.
- Return type:
float
- sample_hyperparams(n=10, min_n_clusters=1, max_n_clusters=30, min_samples_per_cluster=10, max_samples_per_cluster=1000, min_dim=2, max_dim=50)#
Generate multiple copies of this archetype by sampling hyperparameters.
- Parameters:
n (int, optional) – Number of archetype copies to generate. Default is 10.
min_n_clusters (int, optional) – Minimum number of clusters. Default is 1.
max_n_clusters (int, optional) – Maximum number of clusters. Default is 30.
min_samples_per_cluster (int, optional) – Minimum number of samples per cluster. Default is 10.
max_samples_per_cluster (int, optional) – Maximum number of samples per cluster. Default is 1000.
min_dim (int, optional) – Minimum dimensionality. Default is 2.
max_dim (int, optional) – Maximum dimensionality. Default is 50.
- Returns:
A list of new MaxMinArchetype instances with sampled hyperparameters.
- Return type:
list of MaxMinArchetype
Notes
The hyperparameters n_clusters, n_samples, and dim are sampled from Poisson distributions centered around the current archetype’s parameters.
- repliclust.maxmin.archetype.parse_distribution_selection(distributions: list, proportions=None)#
Parse user selection of probability distributions.
Reformats the user-provided list of distributions and proportions into a format suitable for constructing a FixedProportionMix object.
- Parameters:
distributions (list of str or tuple) – Selection of probability distributions to include in each mixture model. Each element is either: - A string representing the name of the distribution. - A tuple (name, params_dict), where name is the distribution name and params_dict is a dictionary of distribution parameters.
proportions (list of float, optional) – Proportions of clusters that should have each distribution. If None, distributions are equally weighted.
- Returns:
A list suitable for constructing a FixedProportionMix object, where each element is a tuple (name, proportion, params_dict).
- Return type:
list of tuple
- Raises:
ValueError – If distributions is not in the expected format.
Notes
To print all valid distribution names, call repliclust.print_supported_distributions().
- repliclust.maxmin.archetype.validate_archetype_args(**args)#
Validate all provided arguments for a MaxMinArchetype.
This function checks the validity of overlap parameters, max-min ratios, and reference quantities.
- Parameters:
**args – Arbitrary keyword arguments containing parameters to validate.
- Raises:
ValueError – If any of the parameters fail their respective validations.
- repliclust.maxmin.archetype.validate_maxmin_ratios(maxmin_ratio=2, arg_name='aspect_maxmin', underlying_param='aspect ratio')#
Check that a max-min ratio is >= 1.
- repliclust.maxmin.archetype.validate_overlaps(max_overlap=0.05, min_overlap=0.001)#
Note that we allow max_overlap=1 and min_overlap=0, which should have the effect of removing one or both overlap constraints.
- repliclust.maxmin.archetype.validate_reference_quantity(ref_qty=1.5, min_allowed_value=1, name='aspect_ref')#
Check that a reference value exceeds its minimum allowed value.
repliclust.maxmin.covariance#
This module provides a class for sampling cluster covariances using the max-min approach.
- class repliclust.maxmin.covariance.MaxMinCovarianceSampler(aspect_ref=1.5, aspect_maxmin=2, radius_maxmin=2)#
Bases:
object
Sample covariances for the clusters in a mixture model by specifying the max-min ratios for various geometric parameters.
See documentation of class MaxMinArchetype for more information.
- aspect_ref#
Reference aspect ratio for clusters in the mixture model.
- Type:
float, >= 1
- aspect_maxmin#
Max-min ratio for the aspect ratios of clusters in the mixture model.
- Type:
float, >= 1
- radius_maxmin#
Max-min ratio for the radii of clusters in the mixture model.
- Type:
float, >= 1
- make_axis_lengths(n_axes, reference_length, aspect_ratio)#
Sample the lengths of all principal axes for a single cluster.
- Parameters:
n_axes (int) – The number of principal axes (same as the dimensionality).
reference_length (float) – Desired geometric mean of the lengths.
aspect_ratio (float) – Desired ratio between longest and shortest lengths.
- Returns:
lengths – Lengths of the principal axes for this cluster.
- Return type:
ndarray
- make_cluster_aspect_ratios(n_clusters)#
Sample aspect ratios for all clusters.
The aspect ratio of a cluster measures how oblong/ellipsoidal the cluster is. It is defined as the ratio between the lengths of the cluster’s longest and shortest principal axes.
- Parameters:
n_clusters (int) – The number of clusters.
- Returns:
out – The aspect ratios for each cluster.
- Return type:
ndarray
- make_cluster_radii(n_clusters, ref_radius, dim)#
Sample cluster radii using pairwise max-min sampling.
Sampling constrains the arithmetic mean of cluster volumes to equal the reference volume (namely ref_radius**dim power). The minimum and maximum cluster radii of the resulting sample average to the reference radius.
- Parameters:
n_clusters (int) – The number of clusters.
ref_radius (float) – The reference radius for the clusters.
dim (int) – The number of dimensions.
- Returns:
radii – Radii for all the clusters.
- Return type:
ndarray
- sample_covariances(archetype)#
Compute the principal axes and their lengths for each cluster in a mixture model.
- Parameters:
archetype (Archetype) – Archetype for a mixture model.
- Returns:
(axes_list, axis_lengths_list) – Tuple with two components. The first component, axes_list, is a list whose i-th element stores the principal axes of the i-th cluster as a matrix (each row is an axis). The second component, axis_lengths_list, is a list whose i-th element stores the lengths of the i-th cluster’s principal axes as a vector. In particular, for any cluster i and axis j, the number axis_lengths_list[i][j] is the length corresponding to the principal axis axes_list[i][j,:].
- Return type:
Tuple[List[ndarray], List[ndarray]]
- validate_k(n_clusters)#
Make sure the number of clusters is valid.
repliclust.maxmin.groupsizes#
This module provides functionality for sampling the number of data points in each cluster using a max-min approach.
- class repliclust.maxmin.groupsizes.MaxMinGroupSizeSampler(imbalance_ratio=2)#
Bases:
GroupSizeSampler
Sample the number of data points in each cluster using pairwise max-min sampling.
- imbalance_ratio#
The desired ratio between largest and smallest group size.
- Type:
float, >=1
- __init__(self, imbalance_ratio)#
- make_group_sizes(self, clusterdata)#
- sample_group_sizes(archetype, total)#
Sample the number of data points for each cluster using pairwise max-min sampling.
- Parameters:
archetype (Blueprint) – Blueprint for a mixture model.
total (int) – The total number of data points (sum of group sizes).
- Returns:
group_sizes – The number of data points for each cluster.
- Return type:
ndarray