In this paper we describe our efforts and experience in constructing GoTag, a distributed system for automatically annotating Medline documents with relevant GO (Gene Ontology) terms. The system is built on top of a service-based text mining infrastructure that integrates tools developed within the Discovery Net and myGrid projects. Two baseline approaches to assigning GO terms have been developed. One assigns GO terms based on directly matching GO term names and synonyms in documents; the other uses a trainable document classifier trained over feature vector representations of documents with which GO terms can be associated using the manually curated yeast genome database. We present preliminary results of evaluating these two approaches and discuss proposals for enhancing both baselines, as well as for constructing a hybrid approach.
pubs.doc.ic.ac.uk: built & maintained by Ashok Argent-Katwala.