# Specifying an Archetype#

In this section, we explain how to customize a data set archetype to obtain synthetic data that suits your needs.

## Basic Parameters#

Basic parameters of each `Archetype`

include the desired number of clusters n_clusters,
the number of dimensions dim of the data, the desired total number of
data points n_samples in each synthetic
dataset and the name of the archetype (name).

## Overlaps Between Clusters#

We quantify the overlap between any pair of clusters as a percentage. Roughly, an overlap of 0.05 indicates that the outer 5% of the clusters’ probability densities overlap.

In a data set with *k* clusters, there are *k(k-1)/2* pairs of clusters.
To quantify the desired overlap for the whole data set, you can use the
parameters `min_overlap`

and `max_overlap`

.
The latter parameter imposes an upper limit on the overlap between any
pair of clusters. Hence, decrease max_overlap if you want to ensure
that clusters are farther apart. On the other, min_overlap sets a
lower limit on the overlap between a cluster and its closest neighbor.
In other words,
increase min_overlap if you want to avoid isolated clusters. Choose
similar values for min_overlap and max_overlap if you would like
to impose a consistent overlap across all synthetic data sets. However,
keep in mind that max_overlap must always exceed min_overlap; in
addition, when the gap between min_overlap and max_overlap is
too small, data generation may take unacceptably long.

The simulation below generates synthetic data sets for various choices of min_overlap and max_overlap. We discuss the results below.

```
from repliclust import set_seed, Archetype, DataGenerator
import matplotlib.pyplot as plt
set_seed(2)
eps = 0.025
overlap_settings = [
{'min_overlap': 1e-3, 'max_overlap': (1+eps)*1e-3},
{'min_overlap': 1e-3, 'max_overlap': 0.5},
{'min_overlap': 0.5, 'max_overlap': (1+eps)*0.5}
]
for i, overlaps in enumerate(overlap_settings):
fig, ax = plt.subplots(
figsize=(10,2), dpi=300,nrows=1, ncols=4)
description = (
r"$\bf{Cluster~Overlaps~around~0.1\%}$" if i==0
else (r"$\bf{Cluster~Overlaps~"
+ "between~0.1\%~and~50\%}$" if (i==1)
else r"$\bf{Cluster~Overlaps~around~50\%}$")
)
fig.suptitle(description + '\n'
+ "min_overlap" + r"$ \approx $"
+ str(overlaps['min_overlap'])
+ ", max_overlap" + r"$ \approx $"
+ str(round(overlaps['max_overlap'],3)),
y=1.15)
for j in range(4):
archetype = Archetype(
min_overlap=overlaps['min_overlap'],
max_overlap=overlaps['max_overlap']
)
X, y, archetype = (DataGenerator(archetype)
.synthesize(quiet=True))
ax[j].scatter(X[:,0], X[:,1], c=y, s=5,
alpha=0.5, linewidth=0.5)
ax[j].set_xticks([]); ax[j].set_yticks([])
fig.subplots_adjust(hspace=0.5)
```

The plots above demonstrate the impact of varying min_overlap and max_overlap. The middle series of plots shows that the difference between max_overlap and min_overlap plays an important role as well.

In the top row of plots, `min_overlap=0.001`

and
`max_overlap=0.0011`

. The small difference between `max_overlap`

and `min_overlap`

means that we are controlling cluster overlap
rather tightly around 0.1%. Not only must no pair of clusters overlap
more than 0.1%, but also each cluster
must overlap at least 0.1% with its closest neighbor. The bottom row
paints a similar picture, except with more overlap between clusters
(50% vs 0.1%).

The middle row shows a different scenario because we leave a
substantial gap between `min_overlap=0.001`

and `max_overlap=0.5`

.
In this case, all clusters must overlap less than 50%, but we permit
much smaller overlaps. This choice increases the variability of
synthetic data sets because within the range of 0.001 to 0.5 we leave
the actual overlaps to chance. Such variation may or not be helpful for
your application.

## Cluster Aspect Ratios#

Each cluster has an ellipsoidal shape that may be round like a ball, or
long and slender like a rod. The *aspect ratio* of a cluster is the
ratio of the length of its longest axis to the length of its shortest
axis. In other words, a high aspect ratio indicates a long and slender
cluster, whereas a low aspect ratio indicates a round cluster. Possible
values for the aspect ratio range from 1 (a perfect sphere) to
infinitely large.

When generating synthetic data using **repliclust**, you can influence
the cluster aspect ratios by changing the parameters
`aspect_ref`

and `aspect_maxmin`

.
The reference aspect ratio, aspect_ref, determines the typical aspect
ratio for all clusters in a synthetic data set. For example, if
`aspect_ref=3`

, the typical cluster is oblong with an aspect ratio of
three. On the other hand, the max-min ratio aspect_maxmin determines
the variability of cluster aspect ratios within the same data set.
More precisely, aspect_maxmin is the ratio of the highest aspect ratio
to the lowest aspect ratio in each data set. For example, if
`aspect_maxmin=3`

, then the “longest” cluster is four
times longer than the most “round” cluster.

The simulation below demonstrates the effect of changing aspect_ref and aspect_maxmin.

```
import matplotlib.pyplot as plt
import repliclust
repliclust.set_seed(1)
fig, ax = plt.subplots(figsize=(8,8), dpi=300, nrows=2, ncols=2)
for i, aspect_ref in enumerate([1, 3]):
for j, aspect_maxmin in enumerate([1, 3]):
archetype = repliclust.Archetype(
n_clusters=5, n_samples=750,
aspect_ref=aspect_ref,
aspect_maxmin=aspect_maxmin,
radius_maxmin=1.0,
min_overlap=0.04, max_overlap=0.05,
distributions=['normal'])
X, y, _ = (repliclust.DataGenerator(archetype)
.synthesize(quiet=True))
ax[i,j].scatter(X[:,0], X[:,1],c=y, s=5,
alpha=0.5, linewidth=0.5)
aspect_ref_description = (r"$\bf{Round~Shape}$" if (i==0)
else r"$\bf{Long~Shape}$")
aspect_maxmin_description = (
r"$\bf{-~no~Variability}$" if (j==0)
else r"$\bf{-~3x~Variability}$"
)
ax[i,j].set_title(
aspect_ref_description + " "
+ aspect_maxmin_description + "\n"
+r"$ aspect\_ref $=" + str(aspect_ref) + ", "
+r"$ aspect\_maxmin $=" + str(aspect_maxmin),
fontsize=10, y=1.05
)
ax[i,j].set_aspect('equal')
ax[i,j].set_xticks([]); ax[i,j].set_yticks([])
plt.subplots_adjust(hspace=0.3, wspace=0.15)
```

## Cluster Volumes#

The volume of a cluster is the volume spanned by the inner 75% of its probability mass. Since cluster volume grows rapidly in high dimensions, we quantify the spatial extent of a cluster in terms of its radius instead. The radius of an ellipsoidal cluster is the spherical radius of a ball with the same volume.

When generating synthetic data with **repliclust**, you can influence
the variability in cluster volumes by changing the
`radius_maxmin`

parameter. This parameter sets the ratio between the
largest and smallest cluster radii within a data set. For example, if
radius_maxmin is 10 and the smallest cluster has unit radius, then the
biggest cluster has a radius of 10. Note that volumes scale
differently from radii. In *dim* dimensions, `radius_maxmin=10`

implies that the biggest cluster volume is 10**dim times
greater than the smallest.

The simulation below demonstrates the effect of varying
`radius_maxmin`

.

```
import repliclust
import matplotlib.pyplot as plt
repliclust.set_seed(1)
fig, ax = plt.subplots(figsize=(10,3.3), dpi=300,
nrows=1, ncols=3)
for i, radius_maxmin in enumerate([1,3,10]):
archetype = repliclust.Archetype(
radius_maxmin=radius_maxmin,
max_overlap=0.05, min_overlap=0.04
)
X, y, _ = (repliclust.DataGenerator(archetype)
.synthesize(quiet=True))
description = (
r"$\bf{Equal~Cluster~Volumes}$"
if i==0
else (r"$\bf{3x~Variability}$"
if (i==1)
else r"$\bf{10x~Variability}$")
)
ax[i].scatter(X[:,0], X[:,1], c=y, s=10, alpha=0.5,
linewidth=0.25, edgecolor='gray')
ax[i].set_xticks([]); ax[i].set_yticks([])
ax[i].set_title(description + '\n'
+ r'$ radius\_maxmin $'
+ " = " + str(radius_maxmin))
```

## Cluster Probability Distributions#

Each cluster consists of data points spread around a central point
according to a probability distribution. While a cluster’s overall
ellipsoidal shape depends on its covariance matrix, the choice of
probability distribution determines how quickly the density of data
points drops with increasing
distance from the central point. For example, the normal
distribution spreads all data points rather tightly around the central
point. By contrast, the exponential
distribution spreads the probability mass further out in space, leaving
a larger share of data points away from the cluster center.
Going even further, heavy-tailed distributions such as the
standard t distribution
with `df=1`

degrees of freedom give rise to *outliers*, data points
very far from the cluster center.

When generating synthetic data using **repliclust**, you can use the
`distributions`

parameter to customize the probability distributions
appearing in your synthetic data sets. As an example, the scatter plots
below visualize the differences between the normal,
exponential, and standard t distributions.

Note the vastly different scales of the X1 and X2 axes. On the left, the normal distribution keeps all data points within about two units of distance from the cluster center. On the right, the heavy-tailed standard t distribution leads to outliers as far as 200 units away. The exponential distribution in the middle strikes a compromise, with distances of up to about five units from the center.

Besides choosing a single probability distribution, you can use multiple
distributions. This choice leads to synthetic
data sets in which different clusters have different probability
distributions. In general, the parameter `distributions`

is a list
containing the names of all probability distributions, as well as their
parameters. Not all distributions have parameters. To obtain a list of
the probability distributions currently supported in **repliclust**, as
well as their parameters, call `get_supported_distributions()`

.

```
from repliclust import get_supported_distributions
get_supported_distributions()
```

```
{'normal': {},
'standard_t': {'df': 5},
'exponential': {},
'beta': {'a': 2.5, 'b': 8.5},
'chisquare': {'df': 5},
'gumbel': {},
'weibull': {'a': 1.5},
'gamma': {'shape': 3},
'pareto': {'a': 10},
'f': {'dfnum': 7, 'dfden': 10},
'lognormal': {'sigma': 0.75}}
```

It is important to
spell the names of distributions exactly as shown above. All names are
adapted from the `numpy.random.Generator`

module. To understand the
meaning of the distributional parameters, see the `numpy`

documentation. For example, click here
to see documentation for the gamma distribution.

When specifying a probability distribution with parameters, the
corresponding entry in `distributions`

should be a tuple
*(name, parameters)*, where *name* is the name of the distribution and
*parameters* is a dictionary of distributional parameters. For example,
the gamma distribution has parameters shape and scale. Below
we generate synthetic data based on an archetype with gamma-distributed
clusters. Note that in **repliclust** you can only change the parameters
listed when calling
`get_supported_distributions()`

,
even though the corresponding `numpy`

class might have additional
parameters. For example, the normal and exponential distributions have
no parameters in **repliclust**.

The simulation below generates a synthetic data set with gamma-distributed clusters.

```
import repliclust
import matplotlib.pyplot as plt
repliclust.set_seed(1)
my_archetype = repliclust.Archetype(
min_overlap=0.01, max_overlap=0.05,
distributions=[('gamma', {'shape': 1, 'scale': 2.0})]
)
X, y, _ = (repliclust.DataGenerator(my_archetype)
.synthesize(quiet=True))
plt.scatter(X[:,0],X[:,1],c=y, s=20, alpha=0.5,
linewidth=0.25, edgecolor='gray')
plt.gcf().set_dpi(300)
plt.gca().set_xticks([]); plt.gca().set_yticks([])
plt.title(r"$\bf{Gamma{-}Distributed~Clusters}$" + '\n'
+ r"$distributions=[('gamma', "
+ "\{'shape': 1, 'scale': 2.0\})]$");
```

When using multiple distributions, **repliclust**
randomly assigns a distribution to each cluster. For example, the
choice `distributions=['normal', 'exponential']`

makes half of the
clusters normally distributed, and the other half exponentially
distributed. To customize these proportions, use the parameter
`distribution_proportions`

. For example, to raise the share of
exponentially distributed clusters to 75%, set
`distribution_proportions=[0.25,0.75]`

. The simulation below
demonstrates such possibilities in a more complex example.

```
import repliclust
import matplotlib.pyplot as plt
repliclust.set_seed(2)
distr_list = ['normal','exponential',
('gamma', {'shape': 1, 'scale': 2.0})]
distr_proportions = [0.25,0.5,0.25]
my_archetype = repliclust.Archetype(
n_clusters=8,
min_overlap=0.005, max_overlap=0.006,
distributions=distr_list,
distribution_proportions=distr_proportions
)
X, y, _ = (repliclust.DataGenerator(my_archetype)
.synthesize(quiet=True))
plt.scatter(X[:,0],X[:,1],c=y,alpha=0.5,
linewidth=0.25, edgecolor='gray')
plt.gcf().set_dpi(300)
ax[i].set_xticks([]); ax[i].set_yticks([])
plt.title(r"$\bf{Using~Multiple~Probability~Distributions}$"
+ '\n' + r"$ distributions=['normal', 'exponential',"
+ r"('gamma', \{'shape': 1, 'scale': 2.0\})] $,"
+ '\n'
+ r"$ distribution\_proportions=[0.25,0.5,0.25] $",
fontsize=10);
```

Can you spot which of the clusters above have normal, exponential, or gamma distributions?

## Group Sizes#

The *group size* of a cluster is the number of data points in it. When
group sizes vary significantly between clusters in the same data set, we
speak of *class imbalance*. When generating synthetic data using
**repliclust**, you can vary the class imbalance by specifying the
`imbalance_ratio`

. This parameter sets the ratio of the greatest to
the smallest number of data points among all clusters in the same data
set. For example, if `imbalance_ratio=10`

then the cluster with the
most data points has ten times more data points than the cluster with the
least number of data points. By contrast, the total number of
data points in the whole data set depends on the parameter `n_samples`

introduced in the Basic Parameters section.

The simulation below demonstrates the effect of changing the
`imbalance_ratio`

.

```
import matplotlib
import repliclust
repliclust.set_seed(1)
fig, ax = plt.subplots(figsize=(10,5), dpi=300,
nrows=1, ncols=2)
for i, imbalance_ratio in enumerate([1, 10]):
archetype = repliclust.Archetype(
n_clusters=2, n_samples=120,
distributions=['normal'],
imbalance_ratio=imbalance_ratio)
X, y, _ = (repliclust.DataGenerator(archetype)
.synthesize(quiet=True))
ax[i].scatter(X[:,0], X[:,1],c=y, alpha=0.5,
linewidth=0.25, edgecolor='gray')
plot_description = (r"$\bf{Perfect~Balance}$" if (i==0)
else r"$\bf{10x~Imbalance}$")
ax[i].set_title(plot_description
+ "\n" +r"$ imbalance\_ratio $="
+ str(imbalance_ratio))
ax[i].set_xticks([]); ax[i].set_yticks([])
```

In the scatter plots above, both datasets have `n_samples=120`

data points. On the left, both clusters have the same number of data
points (class balance). On the right, the bigger cluster has ten
times more data points than the smaller cluster (class imbalance).