Home> Sequence Alignment & Databases> Sequence Alignments Sequence Databases Protein Databases> NCBI ExPASy InterPro

Protein Families and Domains

Family and Domain databases are used to address the question, "what domains are contained within this sequence? Or" what family does this protein belong to?" Although, some family and domain databases were developed with the intent to annotate genomic sequences, basic researchers use these tools to better characterize their proteins of interest.

Classification of proteins within families and with putative functions relies heavily on the determination of protein homologues across (orthologues) and within (paralogues) species. Homology is assigned based on sequence similarity. Protein sequences independent of family or domain classifications are stored in protein sequence databases.

To answer the questions "what families does the sequence belong?" to or "what domains does it contain?", we must first define what we mean by families and domains.

Protein Families:

Protein families are made up of proteins related to one another in some way by sequence similarity, domain composition or structure. Mulitple sequence alignment has been used to discover and define the traits of proteins that have been determined experimentally to perform similar functions. Classic protein families include the globins and receptor kinases.

Protein Domains:

Stretches of amino acids that are the site for biological functions such as post translational modifications, protein-protein or protein-DNA binding are often conserved and reused in multiple proteins. The identification of domains within a sequence suggests possible functions and family relationships. Domains classically are defined as independent folding units. This is a reference to the structural features of domains. Domains also contain highly conserved patterns of amino acid usage, motifs, that can be used to identify the domain within larger protein sequences.

Classifications

Implicit in the process of defining protein families and domains is the classification process. In order to develop family and domain databases, categories of relatedness are developed. Protiens and their subsequences are then assigned to the appropriate categories. The classification schemes are somewhat arbirtrary. Each category reflects the database authors interpretation or formalization of family, domain and motif.

Databases:

Popular databases for protein famililies and domains are the following

BLOCKS: Database of short, ungapped motifs.

SMART: Signaling and extracellular domains. Multi-domain proteins

PRINTS: Database of fingerprints which contain multi-motif patterns describing families

PFAM
(st. louis) (cambridge)

Protein family database described with HMMs.

PROSITE:

Manually aligned sequences. Domains represented by consensus sequences and profiles.

Integrated database sites:

InterPro Ensembl

October 24, 2004