The Gemina system (http://gemina.tigr.org) developed at TIGR is a tool for identification of microbial and viral pathogens
and their associated genomic sequences based on the associated epidemiological data. Gemina has been designed as a tool to
identify epidemiological factors of disease incidence and to support the design of DNA-based diagnostics such as the development
of DNA signature-based assays. The Gemina database contains the full complement of microbial and viral pathogens enumerated
in the Microbial Rosetta Stone database (MRS) [1]. Initially, curation efforts in Gemina have focused on the NIAID category
A, B, and C priority pathogens [2] identified to the level of strains. For the bacterial NIAID category A-C pathogens, for
example, we have included 38 species and 769 strains in Gemina. Representative genomic sequences are selected for each pathogen
from NCBI’s GenBank by a three tiered filtering system and incorporated into TIGR’s Panda DNA sequence database. A single
representative sequence is selected for each pathogen firstly from complete genome sequences (Tier 1), secondly from whole
genome shotgun (WGS) data from genome projects (Tier 2), or thirdly from genomic nucleotide sequences from genome projects
(Tier3). The list of selected accessions is transferred to Insignia when new pathogens are added to Gemina, allowing Insignia’s
Signature Pipeline [3] to be run for each pathogen identified in a Gemina query.