Report of the Workshop on Integration of Microbial Databases




CME is interested in any comments you may have about the contents of this report.
Please contact niels@vitro.cme.msu.edu with any questions or comments you may have.

Table of Contents

1.0 Introduction

2.0 Goals

3.0 IMD Prototypes Demonstration

4.0 Recommended Activities

4.1 Organization and Administration

4.2 System Design and Implementation

4.3 Data to be Integrated

4.3.1 First Priority Data

a.) Nomenclatural Database
As stated above, an up-to-date organismal nomenclature is central to the proposed IMD. It was decided to initiate, as highest priority,an international collaborative effort to curate a single procaryotic nomenclature via the internet. Curators will initially include a representative from Bergey's Manual Trust, DSM (Deutsche Sammlung von Microorganismen und Zellkulturen GmbH) where the International Journal of Systematic Bacteriology is edited, RDP (Ribosomal Database Project), and JSCC (Japan Society of Culture Collections). A World Wide Web annotation interface for the nomenclatural database will be offered by CME.

The nomenclature should include validly published namesof all prokaryotic species, annotated with reclassifications andsynonyms. It must preserve names as they appear in the original data, as well as our association with a "canonical name." It is essential that names preserve the full resolution of the original source (strain level, where available); it can always be mapped to other levels (e.g., species level).

A registry of cross-referenced strain designations (including researcher strain designations and culture collection identifiers) will be developed and maintained. This will provide the hooks for links to specific culture collection data records and sequence databank entries. The primary source of this information will be existing culture collection databases. Also, the Microbial Information Network of Europe (MINE) may serve as a framework for this information.

The nomenclatural database should also include taxon name, taxon rank, parent taxon, author/authority of the taxon name, date of validation,and the source database for the taxon. Although higher level taxa are more problematic, the database must also support the organization of the names into a hierarchical classification. These will be unstable, and subject to change. Early in the project, it is likely that it will benecessary to resort to classifications with no official standing. As formal, phylogenetically based, higher level taxa are published, they will replace the provisional names. This effort will be coordinated with the RDP's nomenclatural needs, and its migration to a DBMS.

b.) Phylogenetic Trees
The microbial phylogeny, based upon 16SrRNA sequence, will be at the core of the system and serve as the framework for accessing the various databases.

To establish a phylogenetic organization of data it is first necessary to identify which type strains have already been sequenced. A quick look through RDP indicates that approximately only thirty percent of them are included. It is absolutely necessary to fill the gaps to provide the user of the IMD with a meaningful correlation between phylogeny and distribution of phenotypic characters. In order to avoid redundant labor, the sequencing effort needs to be coordinated by identifying interested sequencing groups. The Center for Microbial Ecology, along with the RDP and DSM, is currently coordinating such an effort.

Another requirement is to focus on the identification of those sequences, already in the database, that need to be resequenced (full or partial) because of poor quality. To identify those sequences a program needs to be written that allows checking of the sequences for "abnormal" idiosyncrasies in the primary sequence. Improved sequences will improve: (i) the phylogenic analysis; (ii) probe design; and (iii) Amplified Ribosomal DNA Restriction Analysis (ARDRA) comparison.

The 16SrDNA database needs to be complemented by information on the sequence of 23SrDNA whenever the 16SrDNA fails to give sufficient resolution The 23S data can also serve to improve probe design.

At the level of data handling the following steps need to be taken. (i) The capture of sequences from the primary source needs to be improved to avoid escape of deposited sequences. (ii) Release time of new versions of the RDP trees must also be shortened by the release intervals of alignment and trees and by improving the alignment procedure, e.g., by implementation of an automatic alignment by secondary structure.

With respect to environmental sequences, the following problems need to be resolved. Environmental 16SrRNA genes should be amplified using processes that allow amplification of the full sequence. It is furthermore recommended to sequence the homologous stretch of at least 300 nucleotides from the 5' terminus. Other regions could be sequenced at a later stage whenever necessary; the amplificate could be made available to other researchers.

c.) Phenotypic Data
Incorporation of phenotypic data is essential for the utility of an IMD. One method of capturing such data is by use of RKC code. The RKC code is a comprehensive, open-ended coding system of phenotypic characteristics of microorganisms. This system is in use in a number of institutions, e.g. ATCC, the Food and Drug Administration (FDA),the Environmental Protection Agency (EPA), the Microbial Strain Data Network (MSDN), etc. It is now being employed in the construction of the database to be used in updating and producing future editions of Bergey's Manuals. These manuals are considered the authoritative taxonomic treatments of the bacteria. Thus, the critically important information produced by the above organizations could potentially be more easily integrated into the IMD because of the use of a common coding system. One drawback to the use of RKC code is that access tomuch of these data will have to be negotiated and in some cases licenses mayhave to be obtained. Because of this, alternate sources of phenotypic datashould also be explored.

d.) Cellular Fatty Acids (FAME)
One type of phenotypic data considered a priority by workshop participants waswhole cell fatty acid profiles (FAME), which are commonly used for identification and classification of microorganisms in clinical and environmental samples. FAME is considered a high priority because it is one ofthe most common entry points for scientists attempting to obtain further information on a new isolate. Commercially available computerized systems such as the Microbial Identification System (MIDI) provide rapid, reproducibleand inexpensive fatty acid analyses using internal fatty acid libraries. The inclusion of the commercial FAME data into the IMD could provide highly standardized and regularly updated datasets. As this may also require licensing, the IMD will further seek to acquire and evaluate FAME data from research laboratories and literature for potential incorporation into thedatabase.

4.3.2 Existing Databases a. Publicly Accessible On-Line Databases
b. Independently Curated, Specific Databases

4.3.3 Databases Needing Development

a. ARDRA
b. Habitat
c. Databases Obtained with Commercial Test Kits or Systems
d. Images

4.3.4 Other Groups Which Have Microbial Strain Data

5.0 Federation Membership and Responsibilities

6.0 Workshop Participants

7.0 Summary


More information about the integrated database project is available in insights, the CME Newsletter.


Return to CME Publications.