Sequence Databases

Publically available sequence databases have been developed to help store, index and retrieve the thousands of DNA, RNA and protein sequences studied by basic researchers and identified by the vast genomic sequencing efforts.

The common characteristic of sequence databases is that they contain records of individual sequences. Databases can differ in the types of sequences they store. Sequences may be strings of nucleic or amino acid residues that include, but are not limited to genomic, chromosomal, gene-specific, full or partial protein sequences.

Sequence database can be focused on subsets of all DNA, RNA or protein sequences ever documented. They might focus on particular organisms, species or animal families. As well as developmental stages of organisms or tissues.

Primary Archives

A small number of the sequence databases are utilized as primary archives for sequence data. The sequences recorded in these databases have been submitted by basic researchers from academic, industrial and sequencing labs.

Primary ucleotide sequence repositories are GenBANK, EMBL, DDJB. The nucleotide repostiories are developed in collaboration and therefore contain the same records. The sites however differ at the user interface.

The primary protein sequence repositories are Swiss-Prot and PIR. Swiss-Prot has now merged with TrEMBL in the database known as UniProt that also contains data from PIR..

Secondary Archives

Secondary databases are those which derive additional information from the data in the primary databases This includes the TrEMBL database which contains all proteins predicted from genomic sequences within the primary nucleotide database EMBL. To predict these proteins, the coding regions of the nucleotide sequence must be known or predicted. This depends on accurately identifying the reading frame for the nucleotide sequence or generating all possible translated frames.

Protein family and domain databases also fall into the category of secondary archives. The sequences stored within the primary databases are used to generate multiple sequence alignments from which family and domain classifications are made.

Additional databases make use of or link to primary archives that do not focus on nucleotide or amino acid sequences. These databases include protein itneractions, network and pathway databases that might be implicitly considered secondary archives. However, they are typically considered their own categories.


October 21, 2004