Database development
for the P. falciparum genome project

Report of a workshop held on 23 November 1998
at the Wellcome Trust, London

For subsequent meeting reports please refer to MFI's Malaria Genome Database page at www.malaria.org/genome.html

Background and Introduction

This was the second of several workshops to discuss how to maximize the usage of the genomic information arising from the P. falciparum sequencing project. The aim of this meeting, which was jointly chaired by Ross Coppel (Monash) and Chris Newbold (Oxford), was to make recommendations on database development for the project. The 19 participants included representatives from the three sequencing centres, various malaria research labs involved in the project, the EBI, the NCBI, scientists involved in other genome projects and funders.

Cathy Fletcher (Wellcome Trust) introduced the meeting with a brief update on the sequencing project and congratulated the TIGR/NMRI team on the recent publication of the complete sequence of chromosome 2 in Science.

Dan Carucci (NMRI) gave a brief summary of the workshop on functional analysis of the malaria genome, held at TIGR on 9-10 November 1998. At this meeting the bioinformatics issues associated with the generation of large amounts of data from expression analysis had been highlighted.

The following topics were discussed:

1. Release of sequence data - format and accessibility
A position paper written by Ross Coppel and Chris Newbold described current problems associated with accessibility of the genome data, nomenclature and curatorship. Ross had sent out an email questionnaire to malaria scientists worldwide to enable him to assess how easy it is for the research comunity to access and use these data. Approximately one quarter of questionnaires sent to scientists in malaria-endemic countries failed to get through, demonstrating the difficulties in communication faced by these scientists. In fact, no responses were obtained from outside of Europe/USA/Australia, for unknown reasons.

The replies obtained gave a snapshot of the current situation. Although a minority of scientists experienced no problems accessing the data, over 70 percent reported difficulties either with access to the data, problems in analysis or with the available software. The survey identified a wide range of user needs at various levels, and many of the problems are not specific to the malaria genome project; for example, the need for sequence information to be in the right format to be useful, and a lack of bioinformatics expertise among biological scientists. Ross Coppel announced that he planned to include the malaria genome data in the next release of the WHO Sequence Database mailout. He requested some form of agreement from the three sequencing centres authorising him to copy and distribute the data in this way. Funding for this will initially be provided by WHO and Monash University in 1999.

It was suggested that to increase access by scientists in endemic countries there should be regional mirrors established in addition to data distribution by CD-ROM.

Chuong Huynh ( NCBI) described three possible ways data could be released:
  1. immediate release of unfinished sequence via sequencing centre FTP site (current)
  2. immediate release at a separate central location (no accession number)
  3. submission of unfinished sequence via HTGS approach (receive accession number).

Chuong said that currently NCBI provide blastable databases of unfinished P. falciparum sequence data. They pull unfinished data from the sequencing centres’ FTP sites, convert it into a BLAST database file, and provide the interface for people to submit a sequence and BLAST against these databases. Recently Eugene Koonin has provided preliminary annotation for chromosome 3 data on NCBI’s Malaria Genetics and Genomics website.

NCBI are currently revamping their navigational system. Mention was made of the need to ensure that use of sequencing data observed the convention of avoiding chromosome-wide analysis until the responsible sequencing group had a chance to report the data.

In subsequent discussion some participants thought that that release of HTGS data was desirable. The group agreed that access to blastable data needs to be improved and that acceptable procedures for annotation of data available via the web need to be clearly defined (see data release policy below).

2. Formats for finished and annotated data/ development of a common nomenclature
Herve Tettelin (TIGR) explained the gene nomenclature used for chromosome 2, which is now in the published literature. The group agreed to adopt this system and give these recommended names alongside existing gene names.

Dan Lawson (Sanger Centre) described the development of the Malpep database of malarial proteins, which currently consists of 542 predicted ORFs from chromosomes 1, 2, and 3. Additional data from chromosomes 4 and 13 will be added soon. Search tools for gene finding are being developed.

Eula Fung (Stanford) reported that Stanford intend to include the research community in annotation of chromosome 12 by using a submission form, and a discussion of various tools (eg Java, Diana, Ace) for annotation by researchers ensued.

Michael Ferdig (NIH) described the development of the P. falciparum genetic linkage map and database of sequence tagged sites. So far his lab at NIH have placed 34 markers from shotgun sequence from chromosomes 2-12, mostly in chromosomes 6, 7, 8,and 9 .

3. Data Release Policy
In the next 12 months the majority of the genome sequence will be available in unfinished form and Dan Lawson considered that this will cause problems because many groups will want to publish results derived from the unfinished data. The question whether publication by the sequencing centres is being compromised needs to be addressed. The policies for release of contigs were compared: Sanger releases them as they are assembled. Stanford releases contigs by bins - about 100-200kb in size. TIGR had not released any contigs for chromosome 14 at the time of the workshop but contigs and shotgun reads have since been released on the web. Several members considered that unauthorized annotation of the early released data with publication (either on a website or in press) prior to publication by the sequencing centers would be unethical.

In subsequent discussion it was agreed that the current data release policy statement for the malaria genome project should not be altered, and that it should be displayed more prominently on the websites of the sequencing centres.

It was also agreed that there should be more interaction between the sequencing centres and research community concerning the use of unfinished sequence data and that the sequencing centres should communicate frequently to discuss common issues concerned with data release.

4. The need for a malaria database
Michael Ashburner (EBI) shared some of his experience in the development of Flybase, the Drosophila database. Particularly important is the Flybase facility to update annotation and he was concerned that there is nothing equivalent for malaria. Michael suggested that a standardized interface should be developed for all sites containing the malaria genome data and that the data should be reported in a computer-parsable way with agreed syntax; there should be an agreed mechanism for updating data. These points were agreed by the group.

Mark Blaxter (Edinburgh), who acts as curator for the WHO filarial genome project, described how data from the WHO’s parasite genome projects is handled. Martin Aslett at EBI coordinates the projects and curates the data, which is all based on ACeDB. Most of the data consist of expressed sequence tagged (EST) information but there is also immunological, epidemiological and bibliographic data. A CD-ROM version will be released shortly. Several mirrors exist in developing countries (eg at Rio de Janiero). Mark distinguished between sequencing curators and database curators and stressed that curation is an ongoing task, and needs to be in the hands of the research community. He did not think that it is important which final database search engine is chosen, providing cross-readability exists between centres. He suggested that a small number (maximum 3) of curatorship sites should exist; there would have to be agreement on data ownership. These points were agreed.

In subsequent discussion the group agreed that a central organism-specific database was urgently needed, with a common format for entering data, and the ability to download large files. The desirable attributes of a central database were identifed as the database being in the community, incorporating multiple types of data, having a user-friendly interface and containing completed and documented data analysis. Ideally it should contain the following information:

  • Genomic information
  • sequences, contigs, assembled chromosomes with annotation,
  • Genetics (crosses)
  • ESTs, GSTs, and STSs.
  • Maps - physical map, genetic map, YAC and STS maps, optical map.
  • Alignments
  • Polymorphisms- within and between genes and at chromosome ends.
  • Information on field isolates versus lab strains.
  • Expression data from e.g. microarrays, SAGE
  • Immunochemistry data
  • Epitope mapping data
  • Antigens - information on localization etc.
  • Vaccinology data
  • Metabolic pathway data
  • Field data, entomology data
  • Phylogenetics and synteny data
  • Lists of labs and researchers
  • Bibliographic information

Both set-up funding and ongoing funding for curatorship would be needed. Opinions on the cost of curation of a malaria database varied, but possibly one postdoctoral assistant with computing skills, funded for 1 year, would be needed to set the database up. The group agreed that long-term curatorship would require 2-3 biologists, located in the research community, and one computer expert, possibly located at the EBI or NCBI. A commitment for at least 5 years funding is needed, with realistic estimates for development and long-term maintenance costs.

The options for implementing the database were to:

Participants suggested that an advisory committee on database development, drawn from the malaria research community, should be formed.

5. Postgenomic funding needs

Training -
Participants agreed that more bioinformatics training is needed to enable the community to use the genome data, and recognized the current shortage of good bioinformaticists in academia. NCBI and ICPEB in Trieste run basic courses in bioinformatics and an advanced course at Cold Spring Harbour is planned. There will be training for 2 people from each of the WHO genome projects course at Hinxton, 27 March-1 April 1999, and 2 malaria people could be included. In addition, it was agreed that for developing countries web and CD-ROM-based tutorials and regional training are required. Martin Aslett has produced a tutorial on bioinformatics and Mark Blaxter agreed to arrange for it to be put onto CD-ROM for distribution to workers in developing countries.

Database funding - The mechanism for funding any centralized database for the malaria genome project will need to be decided by the funders. In view of the collaborative nature of the project, multiple-agency funding was thought preferable to single-agency funding. However, the procedures for handling applications would need to be discussed as the funders use different mechanisms.

Ross Coppel and Chris Newbold agreed to draw up a draft proposal for a malaria genome database, which will include specifications and rough costings. It will be circulated to the group for input before consideration by the funders at the next meeting of the malaria genome project in Chantilly on 29 January 1999.

Summary of recommendations - no formal votes were taken on any of the recommendations; rather there appeared to be consensus on many of the issues discussed by the participants.

Establishment and curation of a malaria database

Release and accessibility of data.

to improve accessibility of sequence data, particularly in developing countries there should be: