You are here:

Motivation

In the Machine Learning (ML) field we need benchmarks to evaluate the behaviour and performance of our systems. The test suite is chosen/designed with many characteristics in mind: Sometimes we use real datasets (e.g. the UCI repository datasets), sometimes we use synthetic ones (e.g. learning boolean functions such as the 6/11/20/37/70-bit Multiplexer). Real world datasets usually contain noise and inconsistencies, thus being useful to evaluate the robustness of a learning system. However, if our objective is to evaluate the scalability of our system in terms of number of attributes, number of classes, etc. that system is able to cope with, then we need a really broad range of datasets. In that case we can use some synthetic datasets that we can arbitrarily adjust in dimensions, or we inflate the real datasets with irrelevant data. In any case, there is always some degree of bias introduced by this evaluation procedure.

What if we could have an arbitralily adjustable family of real datasets? We could avoid having to artificially inflate the datasets and, therefore, the evaluation procedure could (potentially) be more fair and reliable. This is what the datasets in this repository manage to do.

What can Protein Structure Prediction do for ML benchmarking?

In short, Protein Structure Prediction (PSP) aims to predict the 3D structure of a protein based on its primary sequence, that is, a chain of amino-acids (a string using a 20-letter alphabet).
PSP is, overall, an optimization problem. However, each amino-acid can be characterized by several structural features. A good prediction of these features contributes greatly to obtain better models for the 3D PSP problem. These features can be predicted as classification/regression problems.

These features are predicted from the local context (a window of amino acids) of the target in the chain. In the example below, a feature called Coordination Number (CN) for residue i is predicted using information of itself and its two nearest neighbours in the chain sequence, i-1 and i+1.
By generating different versions of this problem with different window sizes we can construct a family of datasets of arbitrarily increasing number of attributes, that can be useful to evaluate how a learning system can cope with datasets of different sizes.

Moreover, most of these structural features are defined as continuous variables. Thus suitable to treat them as regression problems. However, it is also usual to discretize them and treat them as classification problems. We can decide the number of bins in which we discretize the feature. Therefore, we can construct too a family of datasets with arbitrarily increasing number of classes. The criteria used to discretize will create datasets with well balanced class distribution (using an uniform-frequency - UF discretization), or will create datasets with uneven class distribution (using an uniform-length - UL discretization).

Finally, there are usually two types of basic input information that can be used in these datasets: As a summary, PSP can provide us with a large variety of ML datasets, derived from trying to predict the same protein structural feature with different formulations of inputs and outputs. Thus, we have an adjustable real-world family of benchmarks suitable for testing the scalability of prediction methods in several fronts.

Construction of the repository