Data, Genia Corpus, i2b2 2010 challenge corpus, CRAFT corpus, i2b2 Corpora, PennBioIE, BioNLP-Corpora, BioInfer

warning: Creating default object from empty value in /home/medlingmap/ on line 33.

BioInfer: Bio Information Extraction Resource

Biomedical Information Extraction Resource (BioInfer) is a public resource providing a manually annotated corpus and related resources for information extraction in the biomedical domain.

The corpus contains sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation.

CRAFT: The Colorado Richly Annotated Full Text Corpus

The Colorado Richly Annotated Full Text Corpus (CRAFT) is a large annotated corpus consisting of full texts of biomedical journal articles. It includes both semantic and syntactic annotation layers (listed below) that have been carried out by experienced linguistic and domain-expert annotators. Various formats are available.

A sample of the corpus is available here:


The PennBioIE Oncology Corpus consists of 1414 PubMed abstracts on cancer, concentrating on molecular genetics, and comprising approximately 327,000 words of biomedical text,tokenized and annotated for paragraph, sentence, part of speech, and 24 types of biomedical named entities in five categories of interest. 318 of the abstracts have also been syntactically annotated.


Data sets from shared tasks. In order to enhance the ability of natural language processing (NLP) tools to prise increasingly fine grained information from clinical records, i2b2 has previously provided sets of fully deidentified notes from the Research Patient Data Repository at Partners HealthCare for a series of NLP Challenges organized by Dr. Ozlem Uzuner. We are pleased to now make those notes available to the community for general research purposes. At this time we are releasing the notes (~1,000) from the first i2b2 Challenge as i2b2 NLP Research Data Set #1.


GENIA corpus is a collection of biomedical literature. It has been compiled and annotated within the scope of the GENIA project. The goal of the project is to develop text mining (TM) systems for the domain of molecular biology. The GENIA corpus has been developed to provide a reference material for the development of bio-TM systems. The corpus currently contains 1,999 Medline abstracts which were collected using the three MeSH terms, "human", "blood cells", and "transcription factors". The corpus has been annotated with various levels of linguistic and semantic information.

On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA

Pyysalo, S., F. Ginter, K. Haverinen, J. Heimonen, T. Salakoski, and V. Laippala, "On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA", Proceedings of the Workshop on BioNLP 2007: Biological, translational, and clinical language processing: Association for Computational Linguistics, pp. 25–32, 2007.


BioNLP-Corpora is a repository of biologically and linguistically annotated corpora and biological datasets. It is one of the projects of the BioNLP initiative by the Center for Computational Pharmacology at the University of Colorado Denver Health Sciences Center to create and distribute code, software, and data for applying natural language processing techniques to biomedical texts.

Syndicate content