Coordination number prediction using Learning Classifier Systems: Performance and interpretability
J. Bacardit, M. Stout, J.D. Hirst, N. Krasnogor and J. Blazewicz
In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation (GECCO2006), pp. 247-254, ACM Press, 2006
gecco2006-cn.pdf
Fast Rule Representation for Continuous Attributes in Genetics-Based Machine Learning
Bacardit, J. and Krasnogor, N.
In Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation (GECCO2008), to appear, ACM Press, 2008
gecco2008.pdf
Here is a brief summary:
- The dataset is derived from a set of 1050 protein chains selected using the PDB-REPRDB server
- Following Kinjo et al., the list of proteins was scrambled 10 times. For each scrambled list the first 950 proteins were used for training. The other 100 proteins for test.
- For each protein, its PDB file, specifying its 3D structure, was downloaded from the Protein Data Bank repository.
- The primary sequence of the protein and the Coordination Number definition of each amino acid were extracted/computed from the PDB file
- The PSSM profiles were computed for each protein from its primary sequence using the PSI-BLAST program using the NR database.
- Given a window size and representation (AA/PSSM) the input data of the training/test sets is constucted using in-house perl scripts
- For the regression version of the datasets, the output is directly the CN measure computed previously
- For the classification datasets, a number of bins and a discretization criteria (balanced/non-balanced classes) is needed. The bins are computed separately for each training set using all of its insances and afterwards applied also to the corresponding test set.