Max-Min Implementation#

repliclust.maxmin#

This module provides functionality for implementing data set archetypes based on max-min sampling.

repliclust.maxmin.archetype#

This module implements a archetype for mixture models. The user chooses the desired geometry by setting the ratios between largest and smallest values of various geometric parameters.

class repliclust.maxmin.archetype.MaxMinArchetype(n_clusters=6, dim=2, n_samples=500, max_overlap=0.05, min_overlap=0.001, imbalance_ratio=2, aspect_maxmin=2, radius_maxmin=3, aspect_ref=1.5, name=None, scale=1.0, packing=0.1, distributions=['normal', 'exponential'], distribution_proportions=None, overlap_mode='auto', linear_penalty_weight=0.01, learning_rate='auto')#

Bases: Archetype

A data set archetype that defines the overall geometry of a data set using max-min ratios.

The user sets the ratios between largest and smallest values of various geometric parameters.

Parameters:
  • n_clusters (int) – The desired number of clusters.

  • dim (int) – The desired number of dimensions.

  • radius_maxmin (float, >=1) – Ratio between the maximum and minimum radii among all clusters in a mixture model.

  • aspect_maxmin (float, >=1) – Ratio between the maximum and minimum aspect ratios among all clusters in a mixture model.

  • aspect_ref (float, >=1) – Typical aspect ratio for the clusters in a mixture model. For example, if aspect_ref = 10, we expect that all clusters in the mixture model are strongly elongated.

  • imbalance_maxmin (float, >=1) – Ratio between the greatest and smallest group sizes among all clusters in the mixture model.

  • min_overlap (float in (0,1)) – The minimum required overlap between a cluster and some other cluster. This minimum overlap allows you to guarantee that no cluster will be isolated from all other clusters.

  • max_overlap (float in (0,1)) – The maximum allowed level of overlap between any two clusters. Measured as the fraction of cluster volume that overlaps.

  • scale (float) – Reference length scale for generated data

  • distributions (list of [str | tuple[str, dict]]) – Selection of probability distributions that should appear in each mixture model. Format is a list in which each element is either the name of the probability distribution OR a tuple whose first entry is the name and the second entry is a dictionary of distributional parameters. To print the names of all supported distributions and their parameters (along with default values), print the output of repliclust.get_supported_distributions().

  • distributions_proportions – The proportions of clusters that have each distribution listed in distributions.

  • mode ({"auto", "lda", "c2c"}) – Select the degree of precision when computing cluster overlaps.

Notes

Below is a short glossary of some geometric terms used above.

Group sizeint

The number of data points in a cluster.

Cluster radiusfloat

Geometric mean of the standard deviations along a cluster’s principal axes (eigenvectors of covariance matrix).

Cluster aspect ratiofloat

Ratio between the lengths of a cluster’s longest and shortest principal axes (eigenvectors of covariance matrix). This value equals 1 for a spherical cluster and exceeds 1 for an oblong cluster.

guess_learning_rate(dim)#

Guess the appropriate learning rate as a function of dimension.

repliclust.maxmin.archetype.parse_distribution_selection(distributions: list, proportions=None)#

Parse user selection of probability distributions and reformat it as an input for constructing a FixedProportionMix object.

Parameters:

distributions (list of [ str | tuple[str, dict] ]) – Selection of probability distributions that should appear in each mixture model. Format is a list in which each element is either the name of the probability distribution OR a tuple whose first entry is the name and the second entry is a dictionary of distributional parameters. To print all valid distribution names, call the function repliclust.print_supported_distributions().

Returns:

Input for constructing a FixedProportionMix object.

Return type:

distributions_parsed

repliclust.maxmin.archetype.validate_archetype_args(**args)#

Validate all provided arguments for a MaxMinArchetype.

repliclust.maxmin.archetype.validate_maxmin_ratios(maxmin_ratio=2, arg_name='aspect_maxmin', underlying_param='aspect ratio')#

Check that a max-min ratio is >= 1.

repliclust.maxmin.archetype.validate_overlaps(max_overlap=0.05, min_overlap=0.001)#

Note that we allow max_overlap=1 and min_overlap=0, which should have the effect of removing one or both overlap constraints.

repliclust.maxmin.archetype.validate_reference_quantity(ref_qty=1.5, min_allowed_value=1, name='aspect_ref')#

Check that a reference value exceeds its minimum allowed value.

repliclust.maxmin.covariance#

This module provides a class for sampling cluster covariances using the max-min approach.

class repliclust.maxmin.covariance.MaxMinCovarianceSampler(aspect_ref=1.5, aspect_maxmin=2, radius_maxmin=2)#

Bases: object

Sample covariances for the clusters in a mixture model by specifying the max-min ratios for various geometric parameters.

See documentation of class MaxMinArchetype for more information.

aspect_ref#

Reference aspect ratio for clusters in the mixture model.

Type:

float, >= 1

aspect_maxmin#

Max-min ratio for the aspect ratios of clusters in the mixture model.

Type:

float, >= 1

radius_maxmin#

Max-min ratio for the radii of clusters in the mixture model.

Type:

float, >= 1

make_axis_lengths(n_axes, reference_length, aspect_ratio)#

Sample the lengths of all principal axes for a single cluster.

Parameters:
  • n_axes (int) – The number of principal axes (same as the dimensionality).

  • reference_length (float) – Desired geometric mean of the lengths.

  • aspect_ratio (float) – Desired ratio between longest and shortest lengths.

Returns:

lengths – Lengths of the principal axes for this cluster.

Return type:

ndarray

make_cluster_aspect_ratios(n_clusters)#

Sample aspect ratios for all clusters.

The aspect ratio of a cluster measures how oblong/ellipsoidal the cluster is. It is defined as the ratio between the lengths of the cluster’s longest and shortest principal axes.

Parameters:

n_clusters (int) – The number of clusters.

Returns:

out – The aspect ratios for each cluster.

Return type:

ndarray

make_cluster_radii(n_clusters, ref_radius, dim)#

Sample cluster radii using pairwise max-min sampling.

Sampling constrains the arithmetic mean of cluster volumes to equal the reference volume (namely ref_radius**dim power). The minimum and maximum cluster radii of the resulting sample average to the reference radius.

Parameters:
  • n_clusters (int) – The number of clusters.

  • ref_radius (float) – The reference radius for the clusters.

  • dim (int) – The number of dimensions.

Returns:

radii – Radii for all the clusters.

Return type:

ndarray

sample_covariances(archetype)#

Compute the principal axes and their lengths for each cluster in a mixture model.

Parameters:

archetype (Archetype) – Archetype for a mixture model.

Returns:

(axes_list, axis_lengths_list) – Tuple with two components. The first component, axes_list, is a list whose i-th element stores the principal axes of the i-th cluster as a matrix (each row is an axis). The second component, axis_lengths_list, is a list whose i-th element stores the lengths of the i-th cluster’s principal axes as a vector. In particular, for any cluster i and axis j, the number axis_lengths_list[i][j] is the length corresponding to the principal axis axes_list[i][j,:].

Return type:

Tuple[List[ndarray], List[ndarray]]

validate_k(n_clusters)#

Make sure the number of clusters is valid.

repliclust.maxmin.groupsizes#

This module provides functionality for sampling the number of data points in each cluster using a max-min approach.

class repliclust.maxmin.groupsizes.MaxMinGroupSizeSampler(imbalance_ratio=2)#

Bases: GroupSizeSampler

Sample the number of data points in each cluster using pairwise max-min sampling.

imbalance_ratio#

The desired ratio between largest and smallest group size.

Type:

float, >=1

__init__(self, imbalance_ratio)#
make_group_sizes(self, clusterdata)#
sample_group_sizes(archetype, total)#

Sample the number of data points for each cluster using pairwise max-min sampling.

Parameters:
  • archetype (Blueprint) – Blueprint for a mixture model.

  • total (int) – The total number of data points (sum of group sizes).

Returns:

group_sizes – The number of data points for each cluster.

Return type:

ndarray