1.0 Introduction
2.0 Goals
3.0 IMD Prototypes Demonstration
4.0 Recommended Activities
4.2 System Design and Implementation
4.3 Data to be Integrated
4.3.2 Existing Databases
4.3.3 Databases Needing Development
4.3.4 Other Groups Which Have Microbial Strain Data
5.0 Federation Membership and Responsibilities
6.0 Workshop Participants
7.0 Summary
The potential user community for an integrated microbial database system would include: (i) microbial ecologists exploring the patterns and extent of global microbial biodiversity; (ii) biochemists who could use information about an organisms phylogeny and physiology to select appropriate organisms for comparative studies; (iii) medical microbiologists looking for rapid access to comprehensive information about pathogenic microorganisms; (iv) taxonomists working on microbial classification; (v) educators and students of microbiology needing an updated resource ofmicrobiological information; and (v) industrial microbiologists seekingmore efficient means to recognize new diversity.
The first system utilized a relational database approach based upon a modified version of Sybase. Databases included in the prototype were a subtree of the Ribosomal Database Project phylogenetic tree, a subset of fatty acid (FAME) data from Microbial ID, Inc., Newark, DE (MIDI), RKC encoded, phenotypic data and a microbial taxonomy, both provided by Bergey's Manual Trust. To ask queries within a phylogenetic framework the phylogenetic tree served as the interface for access to the other data. Queries about a particular organism or subtree were asked using a "Query Tree" menu with the results displayed on the tree. Queries incorporated into this prototype were: 1.) show on the tree the phylogenetic distribution of a specified trait; 2.) list all known traits of a specified organism; and 3.) list all common traits of a specified subtree. The advantage of the relational database approach is that it is a mature technology. Thedisadvantages include the need for a Sybase license, no inherent web support, and recreation of 100-200 schemas for existing databases would be required.
The second system was a proposal to use SRS 5 (an improved version of the EMBL supported Sequence Retrieval System [SRS] by Thure Etzold. This World Wide Web based, database network already contains more than 100 molecular biology databases and can, without modification, accomodate most data relevant to microbiology. The advantages of this approach include 1) a large set (over120) of molecular biology databases already connected, 2) a fast query engine that can follow links, 3) a flexible DDL (data definition language), 4) it is freely available with source code, and 5) a responsive development team. Disadvantages include a weak user interface (no phylogenetic interface) and nosupport for non-textual data (The development team has recently shown a willingness to support taxonomies and intregrate phylogenetic/taxonomic interfaces currently being developed by Oliver Strunk and Niels Larsen.).
The third system was a WWW-based prototype that organized the same data used in the relational database prototype within a phylogenetic framework using a subtree of the RDP phylogenetic tree. Written in Perl 5 language it implemented a mechanism for navigating through the tree, contained a method for calculating rRNA signature, and could link the data to the outside world. At the time of the workshop a general query mechanism was not yet complete.
2.) Database Experts (data structure, syntax, semantics, dat contributors).
3.) User Representatives (from industrial, academic, and clinical arenas).
4.) Representatives from funding agencies, developing nations and scientific societies.
B. Coordination Center: It was recommended that the Coordination Center for the Integrated Microbial Database project be located at the Center for Microbial Ecology at Michigan State University. The functions of the center will be:
b.) Providing a legal entity for receipt and distribution of funds associated with the project.
c.) Preparing reports to funding agencies.
d.) Convening meetings and workshops.
e.) Public Relations (press releases, publicity, development of a homepage on the world wide web, etc.).
It was recommended, however, that IMD members adhere to the following minimal requirements when submitting data: 1) be able to upload 100 % consistently formatted ASCII versions of their data (except of course forgraphics), 2) provide a clear description of what the data are and how they relate, either in a concise English form, or using a formal notation. It was recognized that an effective meta data system is an important component of an integrated database. The suitability of current meta-data curation systems needs to be evaluated.
Most data relate to organisms, therefore a consistent organism description and nomenclature is required to connect the data. It was decided to initiate an international collaborative effort to curate a single, comprehensive, prokaryotic nomenclatural database, with synonyms and name histories (via the internet) to beused to interconnect data from participating databases (further discussion below).
An integrated microbial database should be structured around an up-to-date phylogeny and/or taxonomy. This will provide a natural framework for selecting input, and for viewing results. Examples of queries brought up during the workshop include "What are the evolutionary relationships among organisms that are capableof nitrogen fixation, by what pathways is nitrogen fixed, and are there any FAME signatures common for these groups of organisms?". Retrieval of the answers would require information from several contributing databases which would be displayed in a phylogenetic format. Results would also contain links to further information. For example, linking to the culture collection databases would permit a user to inquire further about the availability of a particular culture of nitrogen-fixing bacteria as well as providing additional information about nitrogen-fixing bacteria that are held in culture collections.
For data which do not easily render themselves phylogenetically (e.g. gene locations, metabolic pathways), existing methods for viewing the data must be easy to connect. Generally, an open model should be sought where any type of microbial data or software could be included.
The participants finally agreed that, ideally, anyone with World Wide Web access should be able to navigate through and query all data easily and effectively.
The nomenclature should include validly published namesof all prokaryotic species, annotated with reclassifications andsynonyms. It must preserve names as they appear in the original data, as well as our association with a "canonical name." It is essential that names preserve the full resolution of the original source (strainlevel, where available); it can always be mapped to other levels (e.g., species level).
A registry of cross-referenced strain designations (including researcher strain designations and culture collection identifiers) will be developed and maintained. This will provide the hooks for links to specific culture collection data records and sequence databank entries. The primary source of this information will be existing culture collection databases. Also, the Microbial Information Network of Europe (MINE) may serve as a framework for this information.
The nomenclatural database should also include taxon name, taxon rank, parent taxon, author/authority of the taxon name, date of validation,and the source database for the taxon. Although higher level taxa are more problematic, the database must also support the organization of the names into a hierarchical classification. These will be unstable, and subject to change. Early in the project, it is likely that it will benecessary to resort to classifications with no official standing. As formal, phylogenetically based, higher level taxa are published, they will replace the provisional names. This effort will be coordinated with the RDP's nomenclatural needs, and its migration to a DBMS.
b.) Phylogenetic Trees
The microbial phylogeny, based upon 16SrRNA sequence, will be at the core of the system and serve as the framework for accessing the various databases.
To establish a phylogenetic organization of data it is first necessary to identify which type strains have already been sequenced. A quick look through RDP indicates that approximately only thirty percent of them are included. It is absolutely necessary to fill the gaps to provide the user of the IMD with a meaningful correlation between phylogeny and distribution of phenotypic characters. In order to avoid redundant labor, the sequencing effort needs to be coordinated by identifying interested sequencing groups. The Center for Microbial Ecology, along with the RDP and DSM, is currently coordinating such an effort.
Another requirement is to focus on the identification of those sequences, already in the database, that need to be resequenced (full or partial) because of poor quality. To identify those sequences a program needs to be written that allows checking of the sequences for "abnormal" idiosyncrasies in the primary sequence. Improved sequences will improve: (i) the phylogenic analysis; (ii) probe design; and (iii) Amplified Ribosomal DNA Restriction Analysis (ARDRA) comparison.
The 16SrDNA database needs to be complemented by information on the sequence of 23SrDNA whenever the 16SrDNA fails to give sufficient resolution The 23S data can also serve to improve probe design.
At the level of data handling the following steps need to be taken. (i) The capture of sequences from the primary source needs to be improved to avoid escape of deposited sequences. (ii) Release time of new versions of the RDP trees must also be shortened by the release intervals of alignment and trees and by improving the alignment procedure, e.g., by implementation of an automatic alignment by secondary structure.
With respect to environmental sequences, the following problems need to be resolved. Environmental 16SrRNA genes should be amplified using processes that allow amplification of the full sequence. It is furthermore recommended to sequence the homologous stretch of at least 300 nucleotides from the 5' terminus. Other regions could be sequenced at a later stage whenever necessary; the amplificate could be made available to other researchers.
c.) Phenotypic Data
Incorporation of phenotypic data is essential for the utility of an IMD. One method of capturing such data is by use of RKC code. The RKC code is a comprehensive, open-ended coding system of phenotypic characteristics of microorganisms. This system is in use in a number of institutions, e.g. ATCC, the Food and Drug Administration (FDA),the Environmental Protection Agency (EPA), the Microbial Strain Data Network (MSDN), etc. It is now being employed in the construction of the database to be used in updating and producing future editions of Bergey's Manuals. These manuals are considered the authoritative taxonomic treatments of the bacteria. Thus, the critically important information produced by the above organizations could potentially be more easily integrated into the IMD because of the use of a common coding system. One drawback to the use of RKC code is that access tomuch of these data will have to be negotiated and in some cases licenses mayhave to be obtained. Because of this, alternate sources of phenotypic datashould also be explored.
d.) Cellular Fatty Acids (FAME)
One type of phenotypic data considered a priority by workshop participants waswhole cell fatty acid profiles (FAME), which are commonly used for identification and classification of microorganisms in clinical and environmental samples. FAME is considered a high priority because it is one ofthe most common entry points for scientists attempting to obtain further information on a new isolate. Commercially available computerized systems such as the Microbial Identification System (MIDI) provide rapid, reproducibleand inexpensive fatty acid analyses using internal fatty acid libraries. The inclusion of the commercial FAME data into the IMD could provide highly standardized and regularly updated datasets. As this may also require licensing, the IMD will further seek to acquire and evaluate FAME data from research laboratories and literature for potential incorporation into thedatabase.
Environmental data exist on-line. Such databases (e.g. the World Conservation Monitoring Center [WCMC]) will have to be explicitly identified and linked. The initial effort is to provide entry pointers to these information sources. These links, in their simplest initial form, will allow users to utilize the facilities within these resources. A second stage effort will provide more complete links utilizing data elements in common. Such common elements such as taxonomic trees and phenotypic descriptors will be provided as part of the IMD services.
b.) Independently Curated, Specific Databases
There are several specific databases of special interest to microbiologists that are either on-line or could be so in the near future. These include the metabolic pathway database prepared by Ross Overbeek and Evgeni Selkov's team, the quinone database of Komagata, the gyrB sequence database of Harayama (which appears to be particularly useful at the species level), and perhaps others such as those held by the Belgian group.
These specialty databases would remain curated by those groups specifically interested in them and will most likely not be comprehensive. But they can be extremely useful for certain organism groups and may gain wider use and development if conveniently linked and accessible.
It is recommended that the IMD explore the state of these databases, their development and curation, plans for on-line access, and linkage issues with their developers. The federation would like to encourage linkages to such databases and should work in collaboration with the developers/curators to solve linkage problems.
b.) Habitat
Workshop participants identified the need for habitat information to be part of any IMD. The situation at present is that no specific microbial habitat database exits. Habitat data is presently collected and recorded erratically (and usually incompletely). Sites where one would expect to find rather complete habitat descriptions, for example, International Journal of Systematic Bacteriology, often do not require more than cursory descriptions.
Habitat description is deemed essential to refine the collection location for future collections, draw inferences about the physiological/phenotypic characteristics of an organism, do comparative ecological studies between sites, and locate sites in which to look for similar organisms. It is also important to attempt to harmonize the types of data with those collected by macroecologists and systematisists.
It was recommended that data recorded by individual investigators that is considered as absolutely essential include:
Within these, subtypes are possible: for example, terrestrial.
c.) Databases Obtained with Commercial Test Kits or Systems
A number of such systems exist and are widely used by clinical microbiologists, ecologists, culture collections, and those working with the isolation and characterization of bacteria from environmental samples. Examples of specific commercial products include Biolog and API strips. Data obtained using these commercial products can be found in three general categories of databases: those kept by the commercial firm that produces the test kit/system; those amassed by culture collections; and those obtained by individual users of the products. The databases kept by the commercial firm are likely to be large (amassed using a large group of organisms) and collected under a standard set of operating conditions. The availability of such databases for a IMD is uncertain at this time.
The databases amassed by culture collections (for example, ATCC) are likely to contain information on as many or more organisms than the commercial firms and also to be of high quality having been run at a well documented standard set of experimental conditions. The availability and form of this data is unknown but presumed to be more available than that from commercial firms.
The databases amassed by individuals is likely to be small for any one researcher (user) but perhaps huge for the collective research enterprise. However, operating conditions are also likely to be highly variable from person to person as they modified run conditions to suit their particular systems/organisms. The distributed nature of this data and its variable quality may make it impossible to collect or validate for use in the IMD.
The workshop participants agreed that various types of phenotypic data would be an extremely valuable part of any IMD. Given the relative paucity of phenotypic databases and the potential of these commercial metabolic test kits/systems to provide such data, the steering committee of the IMD or their designee should pursue the availability of data obtained from using such systems first from culture collections participating in the IMD and then from the commercial firms that supply the units/systems.
d.) Images
It is proposed that, ultimately, microbial images be included in the IMD. Among the reasons for this proposal is the fact that an image of the microbe itself (i.e. its morphology), either as an individual cell or a multicellular arrangement, is one of the first attributes of a microbe to be recorded and quantified. In some cases, morphology alone is so distinctive as to afford an identification to the genus or group level (e.g. Caulobacter Gallionella, spirochetes).
Initial efforts should be to compile in the database light and electron micrographs of cells and their distinctive morphological features, for example: (i) appendages; sheaths, intracellular inclusions and distinctive membrane arrangements, etc.; (ii) resting/dormant stages (spores/cysts) and other morphogenetic forms (swarmer cells); and (iii) distinctive multicellular assemblages (colonies, swarms, fruiting bodies). In addition, since many microbes induce characteristic lesions or other morphological changes in association with plant and animal hosts, images of such processes should also be included (e.g. pustules, tubercles, galls, nodules). Images derived from more sophisticated, spectroscopic analyses (e.g. FTIR spectra of cell envelopes) should also be included, although it is recognized that some potential users may not have such data nor the means to acquire it readily.
A longer term effort should be directed to establishing within the database the capacity for image analysis, with the goal of identifying microbes by computerized comparisons to images held in the database.
It is recommended that the IMD steering committee initiate a plan to encourage the participation of these collections as active partners in the development of the IMD. It should be pointed out to them that their partnership role would be beneficial to the participants by access to analytical tools developed by the IMD, and an enhancementof the participants overall collection databases.
An added incentive for the national resource repositories' participation in the IMD would be the value of the resultant database toward regulatory compliance issues such as GMP, GLP, and ISO 9000.
The responsibilities of the Federation members are to: (i) work towards a common integrated database; (ii) share in the division of labor as negotiated among the members; (iii) collaborate with other members to achieve specific objectives; and (iv) participate in seeking funds for Federation objectives.
Return to the IMD Report Table of Contents.
Return to CME Publications.
Return to insights.
Return to Polyphasic Taxonomy Thrust Group.