Lecture Notes in Computer Science, 2005, Volume 3776/2005, 605-610, DOI: 10.1007/11590316_96

A Novel Algorithm for Automatic Species Identification Using Principal Component Analysis

Shreyas Sen, Seetharam Narasimhan and Amit Konar

View Related Documents

Abstract

This paper describes a novel scheme for automatic identification of a species from its genomic data. Random samples of a given length (10,000 elements) are taken from a genome sequence of a particular species. A set of 64 keywords is generated using all possible 3-tuple combinations of the 4 letters: A (for Adenine), T (for Thymine), C (for Cytosine) and G (for Guanine) representing the four types of nucleotide bases in a DNA strand. These 43= 64 keywords are searched in a sample of the genome sequence and their corresponding frequencies of occurrence are determined. Upon repeating this process for N randomly selected samples taken from the genome sequence, an N × 64 matrix of frequency count data is obtained. Then Principal Component Analysis is employed on this data to obtain a Feature Descriptor of reduced dimension (1 × 64). On determining the feature descriptors of different species and also by taking different samples from the same species, it is found that they are unique for a particular species while wide differences exist between those of different species. The variance of the descriptors for a given genome sequence being negligible, the proposed scheme finds extensive applications in automatic species identification.

Fulltext Preview

Image of the first page of the fulltext document