Specifying an Archetype#
In this section, we explain how to customize a data set archetype to obtain synthetic data that suits your needs.
Basic Parameters#
Basic parameters of each Archetype
include the desired number of clusters n_clusters,
the number of dimensions dim of the data, the desired total number of
data points n_samples in each synthetic
dataset and the name of the archetype (name).
Overlaps Between Clusters#
We quantify the overlap between any pair of clusters as a percentage. Roughly, an overlap of 0.05 indicates that the outer 5% of the clusters’ probability densities overlap.
In a data set with k clusters, there are k(k-1)/2 pairs of clusters.
To quantify the desired overlap for the whole data set, you can use the
parameters min_overlap
and max_overlap
.
The latter parameter imposes an upper limit on the overlap between any
pair of clusters. Hence, decrease max_overlap if you want to ensure
that clusters are farther apart. On the other, min_overlap sets a
lower limit on the overlap between a cluster and its closest neighbor.
In other words,
increase min_overlap if you want to avoid isolated clusters. Choose
similar values for min_overlap and max_overlap if you would like
to impose a consistent overlap across all synthetic data sets. However,
keep in mind that max_overlap must always exceed min_overlap; in
addition, when the gap between min_overlap and max_overlap is
too small, data generation may take unacceptably long.
The simulation below generates synthetic data sets for various choices of min_overlap and max_overlap. We discuss the results below.
from repliclust import set_seed, Archetype, DataGenerator
import matplotlib.pyplot as plt
set_seed(2)
eps = 0.025
overlap_settings = [
{'min_overlap': 1e-3, 'max_overlap': (1+eps)*1e-3},
{'min_overlap': 1e-3, 'max_overlap': 0.5},
{'min_overlap': 0.5, 'max_overlap': (1+eps)*0.5}
]
for i, overlaps in enumerate(overlap_settings):
fig, ax = plt.subplots(
figsize=(10,2), dpi=300,nrows=1, ncols=4)
description = (
r"$\bf{Cluster~Overlaps~around~0.1\%}$" if i==0
else (r"$\bf{Cluster~Overlaps~"
+ "between~0.1\%~and~50\%}$" if (i==1)
else r"$\bf{Cluster~Overlaps~around~50\%}$")
)
fig.suptitle(description + '\n'
+ "min_overlap" + r"$ \approx $"
+ str(overlaps['min_overlap'])
+ ", max_overlap" + r"$ \approx $"
+ str(round(overlaps['max_overlap'],3)),
y=1.15)
for j in range(4):
archetype = Archetype(
min_overlap=overlaps['min_overlap'],
max_overlap=overlaps['max_overlap']
)
X, y, archetype = (DataGenerator(archetype)
.synthesize(quiet=True))
ax[j].scatter(X[:,0], X[:,1], c=y, s=5,
alpha=0.5, linewidth=0.5)
ax[j].set_xticks([]); ax[j].set_yticks([])
fig.subplots_adjust(hspace=0.5)
The plots above demonstrate the impact of varying min_overlap and max_overlap. The middle series of plots shows that the difference between max_overlap and min_overlap plays an important role as well.
In the top row of plots, min_overlap=0.001
and
max_overlap=0.0011
. The small difference between max_overlap
and min_overlap
means that we are controlling cluster overlap
rather tightly around 0.1%. Not only must no pair of clusters overlap
more than 0.1%, but also each cluster
must overlap at least 0.1% with its closest neighbor. The bottom row
paints a similar picture, except with more overlap between clusters
(50% vs 0.1%).
The middle row shows a different scenario because we leave a
substantial gap between min_overlap=0.001
and max_overlap=0.5
.
In this case, all clusters must overlap less than 50%, but we permit
much smaller overlaps. This choice increases the variability of
synthetic data sets because within the range of 0.001 to 0.5 we leave
the actual overlaps to chance. Such variation may or not be helpful for
your application.
Cluster Aspect Ratios#
Each cluster has an ellipsoidal shape that may be round like a ball, or long and slender like a rod. The aspect ratio of a cluster is the ratio of the length of its longest axis to the length of its shortest axis. In other words, a high aspect ratio indicates a long and slender cluster, whereas a low aspect ratio indicates a round cluster. Possible values for the aspect ratio range from 1 (a perfect sphere) to infinitely large.
When generating synthetic data using repliclust, you can influence
the cluster aspect ratios by changing the parameters
aspect_ref
and aspect_maxmin
.
The reference aspect ratio, aspect_ref, determines the typical aspect
ratio for all clusters in a synthetic data set. For example, if
aspect_ref=3
, the typical cluster is oblong with an aspect ratio of
three. On the other hand, the max-min ratio aspect_maxmin determines
the variability of cluster aspect ratios within the same data set.
More precisely, aspect_maxmin is the ratio of the highest aspect ratio
to the lowest aspect ratio in each data set. For example, if
aspect_maxmin=3
, then the “longest” cluster is four
times longer than the most “round” cluster.
The simulation below demonstrates the effect of changing aspect_ref and aspect_maxmin.
import matplotlib.pyplot as plt
import repliclust
repliclust.set_seed(1)
fig, ax = plt.subplots(figsize=(8,8), dpi=300, nrows=2, ncols=2)
for i, aspect_ref in enumerate([1, 3]):
for j, aspect_maxmin in enumerate([1, 3]):
archetype = repliclust.Archetype(
n_clusters=5, n_samples=750,
aspect_ref=aspect_ref,
aspect_maxmin=aspect_maxmin,
radius_maxmin=1.0,
min_overlap=0.04, max_overlap=0.05,
distributions=['normal'])
X, y, _ = (repliclust.DataGenerator(archetype)
.synthesize(quiet=True))
ax[i,j].scatter(X[:,0], X[:,1],c=y, s=5,
alpha=0.5, linewidth=0.5)
aspect_ref_description = (r"$\bf{Round~Shape}$" if (i==0)
else r"$\bf{Long~Shape}$")
aspect_maxmin_description = (
r"$\bf{-~no~Variability}$" if (j==0)
else r"$\bf{-~3x~Variability}$"
)
ax[i,j].set_title(
aspect_ref_description + " "
+ aspect_maxmin_description + "\n"
+r"$ aspect\_ref $=" + str(aspect_ref) + ", "
+r"$ aspect\_maxmin $=" + str(aspect_maxmin),
fontsize=10, y=1.05
)
ax[i,j].set_aspect('equal')
ax[i,j].set_xticks([]); ax[i,j].set_yticks([])
plt.subplots_adjust(hspace=0.3, wspace=0.15)
Cluster Volumes#
The volume of a cluster is the volume spanned by the inner 75% of its probability mass. Since cluster volume grows rapidly in high dimensions, we quantify the spatial extent of a cluster in terms of its radius instead. The radius of an ellipsoidal cluster is the spherical radius of a ball with the same volume.
When generating synthetic data with repliclust, you can influence
the variability in cluster volumes by changing the
radius_maxmin
parameter. This parameter sets the ratio between the
largest and smallest cluster radii within a data set. For example, if
radius_maxmin is 10 and the smallest cluster has unit radius, then the
biggest cluster has a radius of 10. Note that volumes scale
differently from radii. In dim dimensions, radius_maxmin=10
implies that the biggest cluster volume is 10**dim times
greater than the smallest.
The simulation below demonstrates the effect of varying
radius_maxmin
.
import repliclust
import matplotlib.pyplot as plt
repliclust.set_seed(1)
fig, ax = plt.subplots(figsize=(10,3.3), dpi=300,
nrows=1, ncols=3)
for i, radius_maxmin in enumerate([1,3,10]):
archetype = repliclust.Archetype(
radius_maxmin=radius_maxmin,
max_overlap=0.05, min_overlap=0.04
)
X, y, _ = (repliclust.DataGenerator(archetype)
.synthesize(quiet=True))
description = (
r"$\bf{Equal~Cluster~Volumes}$"
if i==0
else (r"$\bf{3x~Variability}$"
if (i==1)
else r"$\bf{10x~Variability}$")
)
ax[i].scatter(X[:,0], X[:,1], c=y, s=10, alpha=0.5,
linewidth=0.25, edgecolor='gray')
ax[i].set_xticks([]); ax[i].set_yticks([])
ax[i].set_title(description + '\n'
+ r'$ radius\_maxmin $'
+ " = " + str(radius_maxmin))
Cluster Probability Distributions#
Each cluster consists of data points spread around a central point
according to a probability distribution. While a cluster’s overall
ellipsoidal shape depends on its covariance matrix, the choice of
probability distribution determines how quickly the density of data
points drops with increasing
distance from the central point. For example, the normal
distribution spreads all data points rather tightly around the central
point. By contrast, the exponential
distribution spreads the probability mass further out in space, leaving
a larger share of data points away from the cluster center.
Going even further, heavy-tailed distributions such as the
standard t distribution
with df=1
degrees of freedom give rise to outliers, data points
very far from the cluster center.
When generating synthetic data using repliclust, you can use the
distributions
parameter to customize the probability distributions
appearing in your synthetic data sets. As an example, the scatter plots
below visualize the differences between the normal,
exponential, and standard t distributions.
Note the vastly different scales of the X1 and X2 axes. On the left, the normal distribution keeps all data points within about two units of distance from the cluster center. On the right, the heavy-tailed standard t distribution leads to outliers as far as 200 units away. The exponential distribution in the middle strikes a compromise, with distances of up to about five units from the center.
Besides choosing a single probability distribution, you can use multiple
distributions. This choice leads to synthetic
data sets in which different clusters have different probability
distributions. In general, the parameter distributions
is a list
containing the names of all probability distributions, as well as their
parameters. Not all distributions have parameters. To obtain a list of
the probability distributions currently supported in repliclust, as
well as their parameters, call get_supported_distributions()
.
from repliclust import get_supported_distributions
get_supported_distributions()
{'normal': {},
'standard_t': {'df': 5},
'exponential': {},
'beta': {'a': 2.5, 'b': 8.5},
'chisquare': {'df': 5},
'gumbel': {},
'weibull': {'a': 1.5},
'gamma': {'shape': 3},
'pareto': {'a': 10},
'f': {'dfnum': 7, 'dfden': 10},
'lognormal': {'sigma': 0.75}}
It is important to
spell the names of distributions exactly as shown above. All names are
adapted from the numpy.random.Generator
module. To understand the
meaning of the distributional parameters, see the numpy
documentation. For example, click here
to see documentation for the gamma distribution.
When specifying a probability distribution with parameters, the
corresponding entry in distributions
should be a tuple
(name, parameters), where name is the name of the distribution and
parameters is a dictionary of distributional parameters. For example,
the gamma distribution has parameters shape and scale. Below
we generate synthetic data based on an archetype with gamma-distributed
clusters. Note that in repliclust you can only change the parameters
listed when calling
get_supported_distributions()
,
even though the corresponding numpy
class might have additional
parameters. For example, the normal and exponential distributions have
no parameters in repliclust.
The simulation below generates a synthetic data set with gamma-distributed clusters.
import repliclust
import matplotlib.pyplot as plt
repliclust.set_seed(1)
my_archetype = repliclust.Archetype(
min_overlap=0.01, max_overlap=0.05,
distributions=[('gamma', {'shape': 1, 'scale': 2.0})]
)
X, y, _ = (repliclust.DataGenerator(my_archetype)
.synthesize(quiet=True))
plt.scatter(X[:,0],X[:,1],c=y, s=20, alpha=0.5,
linewidth=0.25, edgecolor='gray')
plt.gcf().set_dpi(300)
plt.gca().set_xticks([]); plt.gca().set_yticks([])
plt.title(r"$\bf{Gamma{-}Distributed~Clusters}$" + '\n'
+ r"$distributions=[('gamma', "
+ "\{'shape': 1, 'scale': 2.0\})]$");
When using multiple distributions, repliclust
randomly assigns a distribution to each cluster. For example, the
choice distributions=['normal', 'exponential']
makes half of the
clusters normally distributed, and the other half exponentially
distributed. To customize these proportions, use the parameter
distribution_proportions
. For example, to raise the share of
exponentially distributed clusters to 75%, set
distribution_proportions=[0.25,0.75]
. The simulation below
demonstrates such possibilities in a more complex example.
import repliclust
import matplotlib.pyplot as plt
repliclust.set_seed(2)
distr_list = ['normal','exponential',
('gamma', {'shape': 1, 'scale': 2.0})]
distr_proportions = [0.25,0.5,0.25]
my_archetype = repliclust.Archetype(
n_clusters=8,
min_overlap=0.005, max_overlap=0.006,
distributions=distr_list,
distribution_proportions=distr_proportions
)
X, y, _ = (repliclust.DataGenerator(my_archetype)
.synthesize(quiet=True))
plt.scatter(X[:,0],X[:,1],c=y,alpha=0.5,
linewidth=0.25, edgecolor='gray')
plt.gcf().set_dpi(300)
ax[i].set_xticks([]); ax[i].set_yticks([])
plt.title(r"$\bf{Using~Multiple~Probability~Distributions}$"
+ '\n' + r"$ distributions=['normal', 'exponential',"
+ r"('gamma', \{'shape': 1, 'scale': 2.0\})] $,"
+ '\n'
+ r"$ distribution\_proportions=[0.25,0.5,0.25] $",
fontsize=10);
Can you spot which of the clusters above have normal, exponential, or gamma distributions?
Group Sizes#
The group size of a cluster is the number of data points in it. When
group sizes vary significantly between clusters in the same data set, we
speak of class imbalance. When generating synthetic data using
repliclust, you can vary the class imbalance by specifying the
imbalance_ratio
. This parameter sets the ratio of the greatest to
the smallest number of data points among all clusters in the same data
set. For example, if imbalance_ratio=10
then the cluster with the
most data points has ten times more data points than the cluster with the
least number of data points. By contrast, the total number of
data points in the whole data set depends on the parameter n_samples
introduced in the Basic Parameters section.
The simulation below demonstrates the effect of changing the
imbalance_ratio
.
import matplotlib
import repliclust
repliclust.set_seed(1)
fig, ax = plt.subplots(figsize=(10,5), dpi=300,
nrows=1, ncols=2)
for i, imbalance_ratio in enumerate([1, 10]):
archetype = repliclust.Archetype(
n_clusters=2, n_samples=120,
distributions=['normal'],
imbalance_ratio=imbalance_ratio)
X, y, _ = (repliclust.DataGenerator(archetype)
.synthesize(quiet=True))
ax[i].scatter(X[:,0], X[:,1],c=y, alpha=0.5,
linewidth=0.25, edgecolor='gray')
plot_description = (r"$\bf{Perfect~Balance}$" if (i==0)
else r"$\bf{10x~Imbalance}$")
ax[i].set_title(plot_description
+ "\n" +r"$ imbalance\_ratio $="
+ str(imbalance_ratio))
ax[i].set_xticks([]); ax[i].set_yticks([])
In the scatter plots above, both datasets have n_samples=120
data points. On the left, both clusters have the same number of data
points (class balance). On the right, the bigger cluster has ten
times more data points than the smaller cluster (class imbalance).