Motivation
In the Machine Learning (ML) field we need benchmarks to evaluate the behaviour and performance of our systems. The test suite is chosen/designed with many characteristics in mind:- Datasets with diverse numbers of attributes
- Low/High number of classes
- Low/High class inbalance
- Etc.
What if we could have an arbitralily adjustable family of real datasets? We could avoid having to artificially inflate the datasets and, therefore, the evaluation procedure could (potentially) be more fair and reliable. This is what the datasets in this repository manage to do.
What can Protein Structure Prediction do for ML benchmarking?
In short, Protein Structure Prediction (PSP) aims to predict the 3D structure of a protein based on its primary sequence, that is, a chain of amino-acids (a string using a 20-letter alphabet).
These features are predicted from the local context (a window of amino acids) of the target in the chain. In the example below, a feature called Coordination Number (CN) for residue i is predicted using information of itself and its two nearest neighbours in the chain sequence, i-1 and i+1.

Moreover, most of these structural features are defined as continuous variables. Thus suitable to treat them as regression problems. However, it is also usual to discretize them and treat them as classification problems. We can decide the number of bins in which we discretize the feature. Therefore, we can construct too a family of datasets with arbitrarily increasing number of classes. The criteria used to discretize will create datasets with well balanced class distribution (using an uniform-frequency - UF discretization), or will create datasets with uneven class distribution (using an uniform-length - UL discretization).
Finally, there are usually two types of basic input information that can be used in these datasets:
- Directly using the amino-acids (AA) of the primary sequence, where one amino acid is defined as one nominal variable with 20 possible values.
- Using a Position-Specific Scoring Matrices (PSSM) representation derived from the primary sequence. The PSSM representation is an statistical profile of the primary sequence that takes into account how this sequence may have evolved. Each amino acid is defined as 20 continuous variables.
Construction of the repository
- We have taken one PSP feature, Coordination Number (please, click here for more details)
- Using a well known set of 1050 proteins and ~260000 amino acids (instances), partitioned into 10-fold cross-validation training/test sets
- Generated versions of the dataset with a window size ranging from 0 to ±9 amino acids
- Either using AA or PSSM as input information
- For the classification versions of the dataset using 2, 3 or 5 classes
- Using either the UF or UL discretization methods for the class definition
- Generating also a regression version of the datasets
- Total: 140 versions of the datasets, taking (uncompressed) around 100GB of disk space