Max-Min Implementation#
repliclust.maxmin#
This module provides functionality for implementing data set archetypes based on max-min sampling.
repliclust.maxmin.archetype#
This module implements a archetype for mixture models. The user chooses the desired geometry by setting the ratios between largest and smallest values of various geometric parameters.
- class repliclust.maxmin.archetype.MaxMinArchetype(n_clusters=6, dim=2, n_samples=500, max_overlap=0.05, min_overlap=0.001, imbalance_ratio=2, aspect_maxmin=2, radius_maxmin=3, aspect_ref=1.5, name=None, scale=1.0, packing=0.1, distributions=['normal', 'exponential'], distribution_proportions=None, overlap_mode='auto', linear_penalty_weight=0.01, learning_rate='auto')#
Bases:
Archetype
A data set archetype that defines the overall geometry of a data set using max-min ratios.
The user sets the ratios between largest and smallest values of various geometric parameters.
- Parameters:
n_clusters (int) – The desired number of clusters.
dim (int) – The desired number of dimensions.
radius_maxmin (float, >=1) – Ratio between the maximum and minimum radii among all clusters in a mixture model.
aspect_maxmin (float, >=1) – Ratio between the maximum and minimum aspect ratios among all clusters in a mixture model.
aspect_ref (float, >=1) – Typical aspect ratio for the clusters in a mixture model. For example, if aspect_ref = 10, we expect that all clusters in the mixture model are strongly elongated.
imbalance_maxmin (float, >=1) – Ratio between the greatest and smallest group sizes among all clusters in the mixture model.
min_overlap (float in (0,1)) – The minimum required overlap between a cluster and some other cluster. This minimum overlap allows you to guarantee that no cluster will be isolated from all other clusters.
max_overlap (float in (0,1)) – The maximum allowed level of overlap between any two clusters. Measured as the fraction of cluster volume that overlaps.
scale (float) – Reference length scale for generated data
distributions (list of [str | tuple[str, dict]]) – Selection of probability distributions that should appear in each mixture model. Format is a list in which each element is either the name of the probability distribution OR a tuple whose first entry is the name and the second entry is a dictionary of distributional parameters. To print the names of all supported distributions and their parameters (along with default values), print the output of repliclust.get_supported_distributions().
distributions_proportions – The proportions of clusters that have each distribution listed in distributions.
mode ({"auto", "lda", "c2c"}) – Select the degree of precision when computing cluster overlaps.
Notes
Below is a short glossary of some geometric terms used above.
- Group sizeint
The number of data points in a cluster.
- Cluster radiusfloat
Geometric mean of the standard deviations along a cluster’s principal axes (eigenvectors of covariance matrix).
- Cluster aspect ratiofloat
Ratio between the lengths of a cluster’s longest and shortest principal axes (eigenvectors of covariance matrix). This value equals 1 for a spherical cluster and exceeds 1 for an oblong cluster.
- guess_learning_rate(dim)#
Guess the appropriate learning rate as a function of dimension.
- repliclust.maxmin.archetype.parse_distribution_selection(distributions: list, proportions=None)#
Parse user selection of probability distributions and reformat it as an input for constructing a FixedProportionMix object.
- Parameters:
distributions (list of [ str | tuple[str, dict] ]) – Selection of probability distributions that should appear in each mixture model. Format is a list in which each element is either the name of the probability distribution OR a tuple whose first entry is the name and the second entry is a dictionary of distributional parameters. To print all valid distribution names, call the function repliclust.print_supported_distributions().
- Returns:
Input for constructing a FixedProportionMix object.
- Return type:
distributions_parsed
- repliclust.maxmin.archetype.validate_archetype_args(**args)#
Validate all provided arguments for a MaxMinArchetype.
- repliclust.maxmin.archetype.validate_maxmin_ratios(maxmin_ratio=2, arg_name='aspect_maxmin', underlying_param='aspect ratio')#
Check that a max-min ratio is >= 1.
- repliclust.maxmin.archetype.validate_overlaps(max_overlap=0.05, min_overlap=0.001)#
Note that we allow max_overlap=1 and min_overlap=0, which should have the effect of removing one or both overlap constraints.
- repliclust.maxmin.archetype.validate_reference_quantity(ref_qty=1.5, min_allowed_value=1, name='aspect_ref')#
Check that a reference value exceeds its minimum allowed value.
repliclust.maxmin.covariance#
This module provides a class for sampling cluster covariances using the max-min approach.
- class repliclust.maxmin.covariance.MaxMinCovarianceSampler(aspect_ref=1.5, aspect_maxmin=2, radius_maxmin=2)#
Bases:
object
Sample covariances for the clusters in a mixture model by specifying the max-min ratios for various geometric parameters.
See documentation of class MaxMinArchetype for more information.
- aspect_ref#
Reference aspect ratio for clusters in the mixture model.
- Type:
float, >= 1
- aspect_maxmin#
Max-min ratio for the aspect ratios of clusters in the mixture model.
- Type:
float, >= 1
- radius_maxmin#
Max-min ratio for the radii of clusters in the mixture model.
- Type:
float, >= 1
- make_axis_lengths(n_axes, reference_length, aspect_ratio)#
Sample the lengths of all principal axes for a single cluster.
- Parameters:
n_axes (int) – The number of principal axes (same as the dimensionality).
reference_length (float) – Desired geometric mean of the lengths.
aspect_ratio (float) – Desired ratio between longest and shortest lengths.
- Returns:
lengths – Lengths of the principal axes for this cluster.
- Return type:
ndarray
- make_cluster_aspect_ratios(n_clusters)#
Sample aspect ratios for all clusters.
The aspect ratio of a cluster measures how oblong/ellipsoidal the cluster is. It is defined as the ratio between the lengths of the cluster’s longest and shortest principal axes.
- Parameters:
n_clusters (int) – The number of clusters.
- Returns:
out – The aspect ratios for each cluster.
- Return type:
ndarray
- make_cluster_radii(n_clusters, ref_radius, dim)#
Sample cluster radii using pairwise max-min sampling.
Sampling constrains the arithmetic mean of cluster volumes to equal the reference volume (namely ref_radius**dim power). The minimum and maximum cluster radii of the resulting sample average to the reference radius.
- Parameters:
n_clusters (int) – The number of clusters.
ref_radius (float) – The reference radius for the clusters.
dim (int) – The number of dimensions.
- Returns:
radii – Radii for all the clusters.
- Return type:
ndarray
- sample_covariances(archetype)#
Compute the principal axes and their lengths for each cluster in a mixture model.
- Parameters:
archetype (Archetype) – Archetype for a mixture model.
- Returns:
(axes_list, axis_lengths_list) – Tuple with two components. The first component, axes_list, is a list whose i-th element stores the principal axes of the i-th cluster as a matrix (each row is an axis). The second component, axis_lengths_list, is a list whose i-th element stores the lengths of the i-th cluster’s principal axes as a vector. In particular, for any cluster i and axis j, the number axis_lengths_list[i][j] is the length corresponding to the principal axis axes_list[i][j,:].
- Return type:
Tuple[List[ndarray], List[ndarray]]
- validate_k(n_clusters)#
Make sure the number of clusters is valid.
repliclust.maxmin.groupsizes#
This module provides functionality for sampling the number of data points in each cluster using a max-min approach.
- class repliclust.maxmin.groupsizes.MaxMinGroupSizeSampler(imbalance_ratio=2)#
Bases:
GroupSizeSampler
Sample the number of data points in each cluster using pairwise max-min sampling.
- imbalance_ratio#
The desired ratio between largest and smallest group size.
- Type:
float, >=1
- __init__(self, imbalance_ratio)#
- make_group_sizes(self, clusterdata)#
- sample_group_sizes(archetype, total)#
Sample the number of data points for each cluster using pairwise max-min sampling.
- Parameters:
archetype (Blueprint) – Blueprint for a mixture model.
total (int) – The total number of data points (sum of group sizes).
- Returns:
group_sizes – The number of data points for each cluster.
- Return type:
ndarray