[1]:
import numpy as np
import matplotlib.pyplot as plt
import rsatoolbox

Demo on unbalanced designs

In this demo we will have a look at calculating RDMs from data which either contians different numbers of measurements per stimulus or where the measurements for different stimuli do not contain the same measurement channels. Such data is common in neural recordings where individual neurons may become unmeasureable during an experiment such that we do not have measurements of all neurons for all stimuli. A common reason for different numbers of repetition for a stimulus are trials which have to be removed due to artifacts or aborted experiments or because the experimentor did not have full control over which stimuli were shown.

First we need some data to work on. This data is very small scale such that we can understand everything which happens. As balanced data we have two repetitions of three different stimuli, which are measured for 5 different measurement channels.

[2]:
data_balanced_array = np.array([
    [0.7, 0.8, 0.9, 1.0, 1.1],
    [0.2, 1.8, 2.9, 1.0, 1.3],
    [2.7, 0.8, 0.2, 1.2, 1.1],
    [1.7, 0.5, 0.9, 1.5, 1.1],
    [1.7, 2.8, 2.2, 1.2, 1.0],
    [1.7, 0.5, 0.4, 1.4, 0.3],
    ])
runs = [0, 1, 0, 1, 0, 1]
stimulus = [0, 0, 1, 1, 2, 2]
descriptors = {'runs': runs, 'stimulus': stimulus}
data_balanced = rsatoolbox.data.Dataset(data_balanced_array, obs_descriptors=descriptors)

For the unbalanced data, we can use similar data, but let’s assume that we now measured the 0 stimulus only once and measured stimulus 1 three times. Also at two times we had to discard measurements from single channels due to technical problems:

[3]:
data_unbalanced_array = np.array([
    [0.7, 0.8, 0.9, 1.0, 1.1],
    [0.2, 1.8, np.nan, 1.0, 1.3],
    [2.7, 0.8, 0.2, 1.2, 1.1],
    [1.7, 0.5, 0.9, 1.5, 1.1],
    [1.7, 2.8, 2.2, 1.2, np.nan],
    [1.7, 0.5, 0.4, 1.4, 0.3],
    ])
runs = [0, 0, 1, 2, 0, 1]
stimulus = [0, 1, 1, 1, 2, 2]
descriptors = {'runs': runs, 'stimulus': stimulus}
data_unbalanced = rsatoolbox.data.Dataset(data_unbalanced_array, obs_descriptors=descriptors)

For the balanced data we can use the normal functions to get a RDM:

[4]:
balanced_rdm = rsatoolbox.rdm.calc_rdm(data_balanced, descriptor='stimulus')
print(balanced_rdm)
rsatoolbox.rdm.RDMs
1 RDM(s) over 3 conditions

dissimilarity_measure =
squared euclidean

dissimilarities[0] =
[[0.     1.088  0.4875]
 [1.088  0.     0.4035]
 [0.4875 0.4035 0.    ]]

descriptors:

rdm_descriptors:
index = [0]

pattern_descriptors:
index = [0, 1, 2]
stimulus = [0, 1, 2]


For the unbalanced data this does not work, because the missing values break the calculation. Also the different numbers of measurements are disregarded when we use the normal calc_rdm function which averages stimulus responses first:

[5]:
unbalanced_rdm_broken = rsatoolbox.rdm.calc_rdm(data_unbalanced, descriptor='stimulus')
print(unbalanced_rdm_broken)
rsatoolbox.rdm.RDMs
1 RDM(s) over 3 conditions

dissimilarity_measure =
squared euclidean

dissimilarities[0] =
[[ 0. nan nan]
 [nan  0. nan]
 [nan nan  0.]]

descriptors:

rdm_descriptors:
index = [0]

pattern_descriptors:
index = [0, 1, 2]
stimulus = [0, 1, 2]


Instead, we can use the slightly slower variant for unbalanced designs, which calculates the RDM correctly:

[6]:
unbalanced_rdm = rsatoolbox.rdm.calc_rdm_unbalanced(data_unbalanced, descriptor='stimulus')
print(unbalanced_rdm)
rsatoolbox.rdm.RDMs
1 RDM(s) over 3 conditions

dissimilarity_measure =
euclidean

dissimilarities[0] =
[[0.         0.24371429 0.6645098 ]
 [0.24371429 0.         0.41717647]
 [0.6645098  0.41717647 0.        ]]

descriptors:

rdm_descriptors:
index = [0]
weights = [[14, 9, 25]]

pattern_descriptors:
index = [0, 1, 2]
stimulus = [0, 1, 2]


This RDM now contains valid values for the dissimilarities. Additionally it gained an rdm_descriptor called ‘weights’, which contains the weight each carried by each dissimilarity. The difference here are caused by different numbers of measurement channels and repetitions for the different stimuli.

As a sanity check we can compute the RDM based on the balanced data with this additional method to check that it results in the same RDM:

[7]:
sanity_rdm = rsatoolbox.rdm.calc_rdm_unbalanced(data_balanced, descriptor='stimulus')
print(sanity_rdm)

print('The differences between the two methods for the three dissimilarities are:')
print(sanity_rdm.get_vectors() - balanced_rdm.get_vectors())
rsatoolbox.rdm.RDMs
1 RDM(s) over 3 conditions

dissimilarity_measure =
euclidean

dissimilarities[0] =
[[0.     1.088  0.4875]
 [1.088  0.     0.4035]
 [0.4875 0.4035 0.    ]]

descriptors:

rdm_descriptors:
index = [0]
weights = [[20, 20, 20]]

pattern_descriptors:
index = [0, 1, 2]
stimulus = [0, 1, 2]


The differences between the two methods for the three dissimilarities are:
[[ 2.22044605e-16  1.49880108e-15 -9.99200722e-16]]

The two RDMs are indeed identical up to normal floating point errors.

This concludes the basic introduction to calculating RDMs for unbalanced experiments.

Implementation explanation

You may ask yourself why there are separate functions for computing RDMs for balanced and unbalanced designs. The main reason for this is that the functions for balanced designs average the different measurements for the same stimulus first and then calculate dissimilarities of these average activations. Computing this average requires that the same channels were measured for all repetitions of the stimulus though, which breaks this computation, when measurements are missing.

To generate a compatible method for computing the dissimilarities we decompose the formula into a sum of kernel evaluations, which we can average correctly. As an example, this is the decomposition for the euclidean distance \(d_{xy}\) of two patterns with measured representations \(x_i\) and \(y_j\) for \(i = 1 \dots N_x, j = 1 \dots N_y\), denoting the mean representations as \(\bar{x}\) and \(\bar{y}\) for a single measurement channel:

\begin{eqnarray} d_{xy} &=& (\bar{x} - \bar{y}) ^2 = \left( \frac{1}{N_x} \sum_{i=1}^{N_x} x_i - \frac{1}{N_y} \sum_{j=1}^{N_y} y_j \right) ^2 \\ &=& \frac{1}{N_x N_x} \sum_{i=1}^{N_x} \sum_{i'=1}^{N_x} x_i x_{i'} + \frac{1}{N_y N_y} \sum_{j=1}^{N_y} \sum_{j'=1}^{N_y} y_j y_{j'} - \frac{2}{N_x N_y} \sum_{i=1}^{N_x} \sum_{j=1}^{N_y} x_i y_j \end{eqnarray}

To compute the dissimilarities for more than one measurement channel we simply need to comput this value per channel and average.

As we now decomposed the function into the separate averages over individual stimulus measurements we can now deal with missing values by simply removing them from the average.

The derivation for the mahalanobis distance is very similar except for adding the precision of the noise in.

Crossvalidation

For Crossvalidation, we only want to use products of differences from different runs. For the standard balanced version we get the following formulas:

\begin{eqnarray} d'_{xy} &=& \frac{1}{N} \sum_{i} (x_i-y_i) \left(\frac{1}{N-1} \sum_{i'\neq i} (x_i - y_i) \right) \\ &=& \frac{1}{N (N-1)} \sum_{i} \sum_{i'\neq i} x_i x_{i'} + \frac{1}{N (N-1)} \sum_{i} \sum_{i'\neq i} y_i y_{i'} - \frac{2}{N (N-1)} \sum_{i} \sum_{i' \neq i} x_i y_j \end{eqnarray}

Thus, we can again split the calculation into sums over the individual pairs of measurements, which allows us to deal with missing measurements gracefully. The only addition we need to make is removing the terms from the sum which come from the same run/partition. Also, this extens completely to the Mahalanobis/CrossNobis.

Poisson-KL based distances

For using the Kullback–Leibler divergence between poisson distributions as a dissimilarity, this decomposition results in a different estimate than the original formulas, as derived here:

\begin{eqnarray} d_{xy} &=& (\bar{x} - \bar{y}) (\log\bar{x} - \log\bar{y}) = \left( \frac{1}{N_x} \sum_{i=1}^{N_x} x_i - \frac{1}{N_y} \sum_{j=1}^{N_y} y_j \right) \left(\log\left[ \frac{1}{N_x} \sum_{i=1}^{N_x} x_i\right] - \log\left[\frac{1}{N_y} \sum_{j=1}^{N_y} y_j\right] \right) \\ &=& \frac{1}{N_x} \sum_{i=1}^{N_x}x_i \log\left[\frac{1}{N_x}\sum_{i'=1}^{N_x} x_{i'}\right] + \frac{1}{N_y} \sum_{j=1}^{N_y} y_j \log\left[\frac{1}{N_y} \sum_{j'=1}^{N_y} y_{j'}\right] - \frac{1}{N_x} \sum_{i=1}^{N_x} x_i \log\left[\frac{1}{N_y}\sum_{j=1}^{N_y} y_j \right] - \frac{1}{N_y} \sum_{j=1}^{N_y} y_j \log\left[\frac{1}{N_x}\sum_{i=1}^{N_x} x_i \right] \\ &\neq& \frac{1}{N_xN_x} \sum_{i=1}^{N_x} \sum_{i'=1}^{N_x} x_i \log x_{i'} + \frac{1}{N_y} \sum_{j=1}^{N_y}\sum_{j'=1}^{N_y} y_j \log y_{j'} - \frac{1}{N_x} \sum_{i=1}^{N_x}\sum_{j=1}^{N_y} x_i \log y_j - \frac{1}{N_y} \sum_{j=1}^{N_y}\sum_{i=1}^{N_x} y_j \log x_i \end{eqnarray}

,i.e. ultimately the problem is that the logarithm is non-linear such that taking the mean of logs is different from the log of the mean.

We nonetheless implement the estimator written in the last row as an analog of the euclidean distances and think this is a sensible estimate, but this estimate is different from the estimate based on the mean firing rates. Thus, we see that for the poisson and poisson_cv distance estimation calc_rdm and calc_rdm_unbalanced give different results:

[8]:
balanced_rdm_poisson = rsatoolbox.rdm.calc_rdm(data_balanced, descriptor='stimulus', method='poisson')
sanity_rdm_poisson = rsatoolbox.rdm.calc_rdm_unbalanced(data_balanced, descriptor='stimulus', method='poisson')
print('calc_rdm result:')
print(balanced_rdm_poisson)
print('calc_rdm_unbalanced result:')
print(sanity_rdm_poisson)

print('The differences between the two methods for the three dissimilarities are:')
print(sanity_rdm_poisson.get_vectors() - balanced_rdm_poisson.get_vectors())
calc_rdm result:
rsatoolbox.rdm.RDMs
1 RDM(s) over 3 conditions

dissimilarity_measure =
poisson

dissimilarities[0] =
[[0.         0.82390993 0.39072889]
 [0.82390993 0.         0.31973757]
 [0.39072889 0.31973757 0.        ]]

descriptors:

rdm_descriptors:
index = [0]

pattern_descriptors:
index = [0, 1, 2]
stimulus = [0, 1, 2]


calc_rdm_unbalanced result:
rsatoolbox.rdm.RDMs
1 RDM(s) over 3 conditions

dissimilarity_measure =
poisson

dissimilarities[0] =
[[0.         0.8536975  0.428618  ]
 [0.8536975  0.         0.26686784]
 [0.428618   0.26686784 0.        ]]

descriptors:

rdm_descriptors:
index = [0]
weights = [[20, 20, 20]]

pattern_descriptors:
index = [0, 1, 2]
stimulus = [0, 1, 2]


The differences between the two methods for the three dissimilarities are:
[[ 0.02978758  0.03788911 -0.05286973]]