Published April 27, 2022 | Version v4
Dataset Open

ClinSpEn Corpus: Parallel English-Spanish COVID-19 Clinical Cases, Terminology and Ontology Concepts

  • 1. Barcelona Supercomputing Center

Description

ClinSpEn Parallel Corpus Collection

This repository contains the complete ClinSpEn corpus collection, which was used for the ClinSpEn shared task at Biomedical WMT 2022.

ClinSpEn is a collection of Gold Standard EN-ES parallel corpora of different types of clinical data: case reports, medical controlled vocabularies/ontologies, and clinical terms and entities extracted from medical content. It includes development and test data translated by professional medical translators that can be used to train and benchmark clinical EN-ES machine translation systems. Additionally, monolingual background data is provided so that the systems' performance can be analyzed in unseen data.

If you use this dataset, please cite:

inproceedings{biowmt22,
  title={Findings of the WMT 2022 Biomedical Translation Shared Task: Monolingual Clinical Case Reports},
  author={Neves, Mariana and Yepes, Antonio Jimeno and Siu, Amy and Roller, Roland and Thomas, Philippe and Navarro, Maika Vicente and Yeganova, Lana and Wiemann, Dina and Di Nunzio, Giorgio Maria and Vezzani, Federica and others},
  booktitle={WMT22-Seventh Conference on Machine Translation},
  pages={694--723},
  year={2022}
}

Data Description

ClinSpEn proposes three different sub-tracks, each based on a different type of clinical data:

1. Clinical Cases:

Parallel EN-ES COVID-19 clinical case reports. The direction of this sub-track is EN>ES.

The dataset’s case reports were carefully selected to cover a wide range of aspects related to the disease: different types of patients (children, adults, elderly and pregnant people, babies), different comorbidities (cancer, mental health issues, immunosuppressed patients) and symptomatology (mild and severe presentations, dermatologic, immunologic and psychiatric manifestations, thrombosis, ...). The reports were translated from English to Spanish by a professional medical translator on a first step and revised by a clinical expert on a second step.

The sample (dev) set and test set are made up of parallel txt files (50 and 152 documents each, respectively), with the Spanish version having a “.es” extension and the English files having a “.en” extension. Each report has been parallelized so that every sentence’s line number corresponds to the same sentence’s line number in both languages.

The background data (9,804 files) is made up of a TSV file with four columns: filename, document number, line number and English line. The clinical cases themselves include COVID-19 case reports as well as diverse content extracted from PubMed.

If you need to map the entries in the join test + background document provided in earlier versions, you may use the "clinspen_clinicalcases_test-set_filename_mapping.tsv" file.

2. Clinical Terminology:

Parallel EN-ES clinical terms extracted from medical literature and clinical records, with particular focus on diseases, symptoms, findings, procedures and professions and translated and revised by professional medical translators. The direction of this sub-track is ES>EN.

The sample (dev) set contains 7,000 terms as a tab-separated file (TSV), with the first column corresponding to English terms and the second column to Spanish terms.

The test data (12,128 terms) is made up of a TSV file with three columns: term number, English term and Spanish term.

The background data (201,890 terms) is made up of a TSV file with two columns: term number and Spanish term.

The term number columns can be used to map the entries in the join test + background document provided in earlier versions.

3. Ontology Concepts:

Parallel EN-ES concepts extracted from various open biomedical ontologies and taxonomies and then manually translated by a professional medical translator. The direction of this sub-track is EN>ES.

The sample (dev) data includes 400 concepts. The terms are presented as tab-separated file (TSV), with the first column corresponding to English terms and the second column to Spanish terms. The third column includes the term’s origin ontology and its correspondent ID (separated by an underscore), while the fourth one includes a link to the concept in OBO Library.

The test data (1,789 concepts) is made up of a TSV file with five columns: term number, English term, Spanish term, ontology id and OBO library URL.

The background data (299,408 concepts) is made up of a TSV file with four columns: term number, English term, ontology id and OBO library URL.

The term number columns can be used to map the entries in the join test + background document provided in earlier versions.

 

Related Links:

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Contact

If you have any question or suggestion, please contact us at the following addresses:

- Salvador Lima-López (<salvador [dot] limalopez [at] gmail [dot] com>)
- Darryl Estrada (<darrylestrada97 [at] gmail [dot] com>)
- Martin Krallinger (<krallinger [dot] martin [at] gmail [dot] com>)

 

Files

clinspen_corpora_complete.zip

Files (19.4 MB)

Name Size Download all
md5:c3ad63892f70aadd7c939eef2044d1c2
19.4 MB Preview Download