Genomic strings are not of fixed length,but provide one- dimensional spatial data that do not divide for conquering by machine
learning into manageable .xed size chunks obeying Dietterich independent and identically distributed assumption.We nonetheless
need to divide genomic strings for conquering by machine learning in this case for genomic prediction. Orthologs are genomic
strings derived from a common ancestor and having the same biological function.Ortholog detection is biologically interesting
since it informs us about protein divergence through evolution, and,in the present context,also has important agricultural
applications. In the present paper is indicated means to obtain an associated (fixed size)attribute vector for genomic string
data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem.The
attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential
metrics,many new to bioinformatics.Many of the differential metrics are based on evolutionary considerations,both theoretical
and empirically observed,in some cases observed by the authors. C5.0 with AdaBoosting activated was employed and the preliminary
results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques
described for ortholog detection on the more readily available EST (incomplete)genomic data.
Machine learning [Mit97,RN95]involves algorithmic techniques for fitting programs to data and for outputting the programs
fit for subsequent use in predicting future data. A program so fit to data is said to be learned.
Amino acid sequences fold into 3-D structures,but that,for us,will be taken into account in future work.See Section 6 below.
IL-2 is interleukin 2,an immune system protein.
Exons contain the coding portions of genes.
Applying attribute values for both chicken-mouse and chicken-human comparisons improves performance over just employing comparisons
between chicken and one of these mammals.
Importantly,the voting weights are bigger for more accurate trees in the sequence of trees.
In the present project we are working only with exons or portions thereof.
Recall from Section 4 above that the ensemble of trees obtained from AdaBoosting makes its decisions by a judiciously weighted
majority vote among the decisions of its constituent trees ?ven more usefully subtle decision making than that of any single
tree.