Published June 28, 2022 | Version 6.3.1
Dataset Open

LivingNER corpus: Named entity recognition, normalization & classification of species, pathogens and food

Description

LivingNER Gold Standard corpus (includes training, validation, test and background sets + MULTILINGUAL RESOURCES)

 

Please cite if you use this dataset:

A. Miranda-Escalada, E. Farré-Maduell, S. Lima-López, D. Estrada, L. Gascó, M. Krallinger, Mention detection, normalization & classification of species, pathogens, humans and food in clinical documents: Overview of LivingNER shared task and resources, Procesamiento del Lenguaje Natural (2022)

@article{amiranda2022nlp, title={Mention detection, normalization \& classification of species, pathogens, humans and food in clinical documents: Overview of LivingNER shared task and resources}, author={Miranda-Escalada, Antonio and Farr{\'e}-Maduell, Eul{`a}lia and Lima-L{\'o}pez, Salvador and Estrada, Darryl and Gasc{\'o}, Luis and Krallinger, Martin}, journal = {Procesamiento del Lenguaje Natural}, year={2022} }

 

1. Introduction

The LivingNER Gold Standard corpus is a collection of 2000 clinical case reports covering a broad range of medical specialities, i.e. infectious diseases (including Covid-19 cases), cardiology, neurology, oncology, dentistry, pediatrics, endocrinology, primary care, allergology, radiology, psychiatry, ophthalmology, urology, internal medicine, emergency and intensive care medicine, tropical medicine, and dermatology annotated with species [SPECIES] (including living organisms and microorganisms) and infectious diseases [ENFERMEDAD] mentions. Species mentions include many pathogens and infectious agents, but also food, allergens, pets or other species, taxonomic groups and organisms of clinical relevance. 

The  LivingNER corpus has also annotations of mentions of humans (tag HUMAN), including the patients itself, family members, healhcare professionals or other persons mentioned in the case reports. Thus it can be useful to extract family history information of patients or information about the social and healthcare personal environment and interactions.

All mentions have been exhaustively manually mapped by experts to their corresponding NCBI Taxonomy identifiers. 

It was used for the LivingNER Shared Task on pathogens and living beings detection and normalization in Spanish medical documents, which was celebrated as part of IberLEF 2022.

 

2. Training, validation, test and background sets

The training set is composed of 1000 clinical case reports. The validation set includes 500 clinical case reports with the same characteristics and the test set includes 485. The background set is a collection of around 13k unannotated case reports that were originally added to prevent manual annotations in the test set during the competition and to create a Silver Standard.

2.1 Annotations format

Annotations and text files are distributed separately. The texts are in plain text (.txt in UTF-8) format, while the annotations are are distributed in a tab-separated file (.tsv) file with one row per annotation:

- For subtask 1 (LivingNER-Species NER track), the .tsv file has the following columns:

  • filename: document name
  • mark: identifier mention mark
  • label: mention type (SPECIES or HUMAN)
  • off0: starting position of the mention in the document
  • off1: ending position of the mention in the document
  • span: textual span

 - For subtask 2 (LivingNER-Species Norm track), the .tsv file has the same columns as the previous one, plus:

  • isH: whether the span is narrower than the NCBITax assigned code
  • isN: whether the mention corresponds to a nosocomial infection
  • iscomplex: whether the span has assigned a combination of NCBITax codes
  • NCBITax: mention code in the NCBI Taxonomy

- For subtask 3 (LivingNER-Clinical IMPACT track), the .tsv file has the following columns:

  • filename
  • isPet (Yes/No)
  • PetIDs (NCBITaxonomy codes of pet & farm animals present in document)
  • isAnimalInjury (Yes/No)
  • AnimalInjuryIDs (NCBITaxonomy codes of animals causing injuries present in document)
  • IsFood (Yes/No)
  • FoodIDs (NCBITaxonomy codes of food mentions present in document)
  • isNosocomial (Yes/No)
  • NosocomialIDs (NCBITaxonomy codes of nosocomial species mentions present in document)

2.2 Important notes about subtask 3 (LivingNER-Clinical IMPACT track):

  • Less clinical case reports. Subtask 3 (LivingNER-Clinical IMPACT track) contains half of the clinical case reports (500 in the training partition, 250 in the validation partition). The list of valid clinical case reports for task 3 is included in the data (train_files_task3.txt and validation_files_task3.txt)
  • Enriched dataset. The GS format is the one described above (a TSV with one line per clinical case report). However, we believe participants may find useful and enriched dataset. Then, we provide an additional dataset, with the mentions of the NER track classified in the 4 Clinical impact categories (food, pet&farm animals, animals causing injuries and nosocomial). It is a TSV file with one row per annotation, and with the following columns: filename, mark, label, off0, off1, span, isPet, isAnimalInjury, isFood, isNosocomial, isH, iscomplex, code

 

3. Multilingual resources

We have generated the annotated training and validation sets in 7 languages:

  • English
  • Portuguese
  • Catalan
  • Galician
  • Italian
  • French
  • Romanian

 

The process was:

  1. The text files were translated with a neural machine translation system.
  2. The annotations were translated with the same neural machine translation system.
  3. The translated annotations were transferred to the translated text files using an annotation transfer technology.

The text files are stored in the multilingual_resources/training-text-files and multilingual_resources/validation-text-files subfolders.

The annotated TSV files are stored in the multilingual_resources/annotation_transfer subfolder.

For the sake of comparison, we incorporate as well the annotations that resulted from the LINNAEUS tool in the multilingual_resources/linneaus subfolder.

If you want to visualize the multilingual resources, check out this Brat server: https://temu.bsc.es/mLivingNER/#/translations/

For instance, you can see the parallel annotations in English vs in French, or in Spanish (the gold standard) vs in Catalan.

 

Resources

 

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any questions or suggestions, please contact us at:


- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

Additional resources and corpora

If you are interested in SympTEMIST, you might want to check out these corpora and resources:

  • DisTEMIST (Corpus of disease mentions and normalization to SNOMED CT, different document collection, some overlapping documents)
  • SympTEMIST (Corpus of symptoms, sign and findings mentions and normalization to SNOMED CT, different document collection, some overlapping documents)
  • MedProcNER (Corpus of clinical procedure mentions and normalization to SNOMED CT, different document collection, some overlapping documents)
  • PharmaCoNER (Corpus of medications, drugs, chemical substances, genes, proteins and vaccine mentions and normalization, different document collection, some overlapping documents)
  • MEDDOPROF (Corpus of mentions of professions, occupations and working status and normalization, different document collection)
  • MEDDOPLACE (Corpus of mentions of place-related entity mentions, including departments, nationalities or patient movements etc.. and normalization, different document collection)
  • MEDDOCAN (Corpus of mentions of Personal Health Identifiers (PHI), differentdocument collection)
  • CANTEMIST (Corpus of cancer tumor morphology mentions and normalization, different document collection)
  • CodiESp (Corpus of clinical case reportes with assigned clinical codes from ICD10, Spanish version, different document collection, some overlapping documents)
  • SPACCC-POS (Corpus of clinical case reports in Spanish annotated with POS-tags, different document collection, some overlapping documents)
  • SPACCC-TOKEN (Corpus of clinical case reports in Spanish annotated with token-tags (word mention boundaries), different document collection, some overlapping documents)
  • SPACCC-SPLIT (Corpus of clinical case reports in Spanish annotated with sentence boundary-tags, different document collection, some overlapping documents)
  • MESINESP-2 (Corpus of manually indexed records with DeCS /MeSH terms comprising scientific literature abstracts, different document collection, some overlapping documents)

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

livingner-bundle_training_valid_test_background_multilingual.zip

Files (59.0 MB)