rsatoolbox.data.dataset module

Definition of RSA Dataset class and subclasses

@author: baihan, jdiedrichsen, bpeters, adkipnis

class rsatoolbox.data.dataset.Dataset(measurements, descriptors=None, obs_descriptors=None, channel_descriptors=None, check_dims=True)[source]

Bases: DatasetBase

Dataset class is a standard version of DatasetBase. It contains one data set - or multiple data sets with the same structure

get_measurements()[source]

Getter function for measurements

get_measurements_tensor(by)[source]

Returns a tensor version of the measurements array, split by an observation descriptor. This procedure will keep the order of measurements the same as it is in the dataset.

Parameters

by (String) – the descriptor by which the splitting is made

Returns

n_obs_rest x n_channel x n_obs_by 3d-array, where n_obs_by is are the unique values that the obs_descriptor “by” takes, and n_obs_rest is the remaining number of observations per unique instance of “by”

Return type

measurements_tensor (numpy.ndarray)

nested_odd_even_split(l1_obs_desc, l2_obs_desc)[source]

Nested version of odd_even_split, where dataset is first partitioned according to the l1_obs_desc and each partition is again partitioned according to the l2_obs_desc (after which the actual oe-split occurs).

Useful for balancing, especially if the order of your measurements is inconsistent, or if the two descriptors are not orthogonalized. It’s advised to apply .sort_by(l2_obs_desc) to the output of this function.

Parameters

l1_obs_desc (str) – Observation descriptor, basis for level 1 partitioning (must contained in keys of dataset.obs_descriptors)

Returns

subset of the Dataset with odd list-indices after partitioning

according to obs_desc

even_split (Dataset):

subset of the Dataset with even list-indices after partitioning according to obs_desc

Return type

odd_split (Dataset)

odd_even_split(obs_desc)[source]

Perform a simple odd-even split on an rsatoolbox dataset. It will be partitioned into n different datasets, where n is the number of distinct values on dataset.obs_descriptors[obs_desc]. The resulting list will be split into odd and even (index) subset. The datasets contained in these subsets will then be merged.

Parameters

obs_desc (str) – Observation descriptor, basis for partitioning (must contained in keys of dataset.obs_descriptors)

Returns

subset of the Dataset with odd list-indices after partitioning

according to obs_desc

even_split (Dataset):

subset of the Dataset with even list-indices after partitioning according to obs_desc

Return type

odd_split (Dataset)

sort_by(by)[source]

sorts the dataset by a given observation descriptor

Parameters

by (String) – the descriptor by which the dataset shall be sorted

Returns

split_channel(by)[source]

Returns a list Datasets splited by channels

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of Datasets, splitted by the selected channel_descriptor

split_obs(by)[source]

Returns a list Datasets splited by obs

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of Datasets, splitted by the selected obs_descriptor

subset_channel(by, value)[source]

Returns a subsetted Dataset defined by certain channel value

Parameters
  • by (String) – the descriptor by which the subset selection is made from channel dimension

  • value – the value by which the subset selection is made from channel dimension

Returns

Dataset, with subset defined by the selected channel_descriptor

subset_obs(by, value)[source]

Returns a subsetted Dataset defined by certain obs value

Parameters
  • by (String) – the descriptor by which the subset selection is made from obs dimension

  • value – the value by which the subset selection is made from obs dimension

Returns

Dataset, with subset defined by the selected obs_descriptor

class rsatoolbox.data.dataset.DatasetBase(measurements, descriptors=None, obs_descriptors=None, channel_descriptors=None, check_dims=True)[source]

Bases: object

Abstract dataset class. Defines members that every class needs to have, but does not implement any interesting behavior. Inherit from this class to define specific dataset types

Parameters
  • measurements (numpy.ndarray) – n_obs x n_channel 2d-array,

  • descriptors (dict) – descriptors (metadata)

  • obs_descriptors (dict) – observation descriptors (all are array-like with shape = (n_obs,…))

  • channel_descriptors (dict) – channel descriptors (all are array-like with shape = (n_channel,…))

Returns

dataset object

static from_df(df: DataFrame, channels: Optional[List] = None, channel_descriptor: Optional[str] = None) Dataset[source]

Create a Dataset from a Pandas DataFrame

Float columns are interpreted as channels, and their names stored as a channel descriptor “name”. Columns of any other datatype will be interpreted as observation descriptors, unless they have the same value throughout, in which case they will be interpreted as Dataset descriptor.

Parameters
  • df (DataFrame) – a long-format DataFrame

  • channels (list) – list of column names to interpret as channels. By default all float columns are considered channels.

  • channel_descriptor (str) – Name of the channel descriptor to create on the Dataset which contains the column names. Default is “name”.

Returns

RSAtoolbox Dataset representing the data from the DataFrame

Return type

Dataset

save(filename, file_type='hdf5', overwrite=False)[source]

Saves the dataset object to a file

Parameters
  • filename (String) – path to the file [or opened file]

  • file_type (String) – Type of file to create: hdf5: hdf5 file pkl: pickle file

  • overwrite (Boolean) – overwrites file if it already exists

split_channel(by)[source]

Returns a list Datasets split by channels

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of Datasets, splitted by the selected channel_descriptor

split_obs(by)[source]

Returns a list Datasets split by obs

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of Datasets, splitted by the selected obs_descriptor

subset_channel(by, value)[source]

Returns a subsetted Dataset defined by certain channel value

Parameters
  • by (String) – the descriptor by which the subset selection is made from channel dimension

  • value – the value by which the subset selection is made from channel dimension

Returns

Dataset, with subset defined by the selected channel_descriptor

subset_obs(by, value)[source]

Returns a subsetted Dataset defined by certain obs value

Parameters
  • by (String) – the descriptor by which the subset selection is made from obs dimension

  • value – the value by which the subset selection is made from obs dimension

Returns

Dataset, with subset defined by the selected obs_descriptor

to_df(channel_descriptor: Optional[str] = None) DataFrame[source]

returns a Pandas DataFrame representing this Dataset

Channels, observation descriptors and Dataset descriptors make up the columns. Rows represent observations.

Note that channel descriptors beyond the one used for the column names will not be represented.

Parameters

channel_descriptor – Which channel descriptor to use to label the data columns in the Dataframe. Defaults to the first channel descriptor.

Returns

A pandas DataFrame representing the Dataset

Return type

DataFrame

to_dict()[source]

Generates a dictionary which contains the information to recreate the dataset object. Used for saving to disc

Returns

dictionary with dataset information

Return type

data_dict(dict)

class rsatoolbox.data.dataset.TemporalDataset(measurements, descriptors=None, obs_descriptors=None, channel_descriptors=None, time_descriptors=None, check_dims=True)[source]

Bases: Dataset

TemporalDataset for spatio-temporal datasets

Parameters
  • measurements (numpy.ndarray) – n_obs x n_channel x time 3d-array,

  • descriptors (dict) – descriptors (metadata)

  • obs_descriptors (dict) – observation descriptors (all are array-like with shape = (n_obs,…))

  • channel_descriptors (dict) – channel descriptors (all are array-like with shape = (n_channel,…))

  • time_descriptors (dict) –

    time descriptors (alls are array-like with shape= (n_time,…))

    time_descriptors needs to contain one key ‘time’ that specifies the time-coordinate. if None is provided, ‘time’ is set as (0, 1, …, n_time-1)

Returns

dataset object

bin_time(by, bins)[source]

Returns an object TemporalDataset with time-binned data.

Parameters

bins (array-like) – list of bins, with bins[i] containing the vector of time-points for the i-th bin

Returns

a single TemporalDataset object

Data is averaged within time-bins. ‘time’ descriptor is set to the average of the binned time-points.

convert_to_dataset(by)[source]
converts to Dataset long format.

time dimension is absorbed into observation dimension

Parameters

by (String) – the descriptor which indicates the time dimension in the time_descriptor

Returns

Dataset

sort_by(by)[source]

sorts the dataset by a given observation descriptor

Parameters

by (String) – the descriptor by which the dataset shall be sorted

Returns

split_channel(by)[source]

Returns a list TemporalDataset splited by channels

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of TemporalDataset,

split by the selected channel_descriptor

split_obs(by)[source]

Returns a list TemporalDataset splited by obs

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of TemporalDataset, splitted by the selected obs_descriptor

split_time(by)[source]

Returns a list TemporalDataset splited by time

Parameters

by (String) – the descriptor by which the splitting is made

Returns

list of TemporalDataset, splitted by the selected time_descriptor

subset_channel(by, value)[source]

Returns a subsetted TemporalDataset defined by a certain channel descriptor value

Parameters
  • by (String) – the descriptor by which the subset selection is made from channel dimension

  • value – the value by which the subset selection is made from channel dimension

Returns

TemporalDataset, with subset defined by the selected channel_descriptor

subset_obs(by, value)[source]

Returns a subsetted TemporalDataset defined by certain obs value

Parameters
  • by (String) – the descriptor by which the subset selection is made from obs dimension

  • value – the value by which the subset selection is made from obs dimension

Returns

TemporalDataset, with subset defined by the selected obs_descriptor

subset_time(by, t_from, t_to)[source]

Returns a subsetted TemporalDataset with time between t_from and t_to

Parameters
  • by (String) – the descriptor by which the subset selection is made from channel dimension

  • t_from – time-point from which onwards data should be subsetted

  • t_to – time-point until which data should be subsetted

Returns

TemporalDataset

with subset defined by the selected time_descriptor

to_dict()[source]

Generates a dictionary which contains the information to recreate the TemporalDataset object. Used for saving to disc

Returns

dictionary with TemporalDataset information

Return type

data_dict(dict)

rsatoolbox.data.dataset.dataset_from_dict(data_dict)[source]

regenerates a Dataset object from the dictionary representation

Currently this function works for Dataset, DatasetBase, and TemporalDataset objects

Parameters

data_dict (dict) – the dictionary representation

Returns

the regenerated Dataset

Return type

data(Dataset)

rsatoolbox.data.dataset.load_dataset(filename, file_type=None)[source]

loads a Dataset object from disc

Parameters

filename (String) – path to file to load

rsatoolbox.data.dataset.merge_subsets(dataset_list)[source]

Generate a dataset object from a list of smaller dataset objects (e.g., as generated by the subset_* methods). Assumes that descriptors, channel descriptors and number of channels per observation match.

Parameters

dataset_list (list) – List containing rsatoolbox datasets

Returns

rsatoolbox dataset created from all datasets in dataset_list

Return type

merged_dataset (Dataset)