SocialDisNER corpus: gold standard annotations for detection of disease mentions in Spanish tweets

doi:10.5281/zenodo.6803567

Published March 15, 2022 | Version 1.4.0

Dataset Open

SocialDisNER corpus: gold standard annotations for detection of disease mentions in Spanish tweets

1. Barcelona Supercomputing Center

If you use any data from this repository, please cite our scientific paper instead of the Zenodo repo:

Luis Gasco Sánchez, Darryl Estrada Zavala, Eulàlia Farré-Maduell, Salvador Lima-López, Antonio Miranda-Escalada, and Martin Krallinger. 2022. The SocialDisNER shared task on detection of disease mentions in health-relevant content from social media: methods, evaluation, guidelines and corpora. In Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task, pages 182–189, Gyeongju, Republic of Korea. Association for Computational Linguistics.

@inproceedings{gasco2022socialdisner,
  title = "The {S}ocial{D}is{NER} shared task on detection of disease mentions in health-relevant content from social media: methods, evaluation, guidelines and corpora",
    author = "Gasco S{\'a}nchez, Luis  and
      Estrada Zavala, Darryl  and
      Farr{\'e}-Maduell, Eul{\`a}lia  and
      Lima-L{\'o}pez, Salvador  and
      Miranda-Escalada, Antonio  and
      Krallinger, Martin",
    booktitle = "Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop {\&} Shared Task",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.smm4h-1.48",
    pages = "182--189"
}

Introduction:
The SocialDisNER corpus of the SMM4H 2022 – Task 10 task focus on the recognition of disease mentions in tweets written in Spanish after selecting primarily first-hand experience of diseases and other health-relevant content (from patient associations, professional healthcare institutions, and through followers of patient association accounts of a diversity of pathologies including rare diseases, mental health, cancer, etc..).

SocialDisNER Gold Standard

The Gold Standard corpus was manually annotated by medical experts following the SMM4H-SocialDisNER guidelines. These guidelines were adapted from previous efforts used to annotate patient clinical records and medical literature. It covers rules for annotating mentions of diseases in health-related tweets in Spanish,

The training set consists of 5000 tweets written in Spanish and the validation set consists of 2500 tweets written in Spanish. Both sets have been manually annotated by healthcare professionals. The test dataset contains 23430 tweets, although only 2000 will be used to evaluate the systems participating in the task (the rest is background set). We don't plan to publish the test set, but if you want you can test your system from SocialDisNER Codalab.

SocialDisNER Large Scale Corpus

The large-scale data contains mentions automatically extracted from a set of 85000 tweets. Separate datasets are shown for each entity including diseases, drugs, symptoms, professions, procedures, species, morphology neoplasm, and persons.

SocialDisNER co-mention networks

We have computed a co-occurrence matrix of the extracted diseases, as well as several co-mention matrices between the disease mentions and the rest of the entities in the large-scale corpora.

File structure:

The structure of the corpus is:

SocialDisNER_Data:
- training-validation-data folder
  - train-valid-txt-files: folder with training and validation text files. One text file per tweet, the file name corresponds to the tweet id. One sub-directory per corpus split (train and valid). The files named ids_dev_set.txt and ids_train_set.txt contain the list of file identifiers for each of the data splits (validation and train).
  - mentions.tsv: This file contains the manually annotated disease mentions. The file has the following fields:
    - tweets_id: This is the id of the tweet, using Twitter API you can query the content of the tweet.
    - Begin: This is the position in the tweet where the annotation was found.
    - End: This is the position of the last character of the annotation in the tweet.
    - Type: This is the type of entity found, in our case "ENFERMEDAD".
    - Extraction: This is the literal extraction, in other words, the fragment of text which refers to the annotation.
- test-data folder:
  - test-data-txt-files: folder with test text files. One file per tweet, the file name corresponds to the tweet id. The folder contains 23430 tweets to be used as test set of the task. Of them, 2000 will be used to evaluate the participating systems.

SocialDisNER_LargeScale_additionaldata:
- socialdisner_diseases:
  - tweets_txt: Folder with large-scale tweet database. One text file per tweet, the file name corresponds to the tweet id.
  - diseases_mentions.tsv: This file contains the automatically annotated disease mentions from the large-scale SocialDisNER corpus (Silver Standard). The structure is the same than the Golden Standard annotations.
- socialdisner_ENTITY: Each folder with this naming convention contains the following data structure. Corpora have been generated with mentions of diseases, drugs, symptoms, professions, procedures, species, morphology neoplasm and persons
  - tweets_txt: Folder with large-scale tweet database. One text file per tweet, the file name corresponds to the tweet id.
  - ENTITY_mentions.tsv: This file contains the automatically annotated mentions of type “ENTITY” from the large-scale SocialDisNER corpus (Silver Standard). The structure is the same than the Golden Standard annotations.
- socialdisner_networks: This folder contains tsv files containing the co-mention matrices between the diseases and the rest of the entities of the large-scale socialdisner data. Each file follows the following naming convention:
  - socialdisner_disease-ENTITY_net.tsv: The tsv file contains a series of columns and rows corresponding to the mentions used for building the matrix. Each column is separated by “;”. The type of each mention is identified by the label in parentheses of each title. The count represents the number of times that mention x and mention y were found in the same tweet of the large-scale dataset.
  - socialdiser_disease_net.tsv: This tsv file contains the array of socialdisner-disease large-scale corpus co-mentions separated by ";". This file can be loaded into NetworkX to perform disease co-morbidity analysis on the socialdisner-disease large-scale data.

Note: In previous versions of the dataset the order of the columns in the mentions.tsv file was not in the correct order. From this version onwards the order is correct and adequate to send the predictions of the task.

For further information, please visit https://temu.bsc.es/socialdisner/

Summary statistics:

Manually annotated data
	Training set	Development set
# tweets	5000	2500
# characters	1253431	516768
# tokens	211555	84478
Avg. char / tweet	250.69	206.71
Avg. tok. / tweet	42.31	33.79
# mentions	15173	4252
# unique mentions	4407	1413

Large-scale annotated data (Silver Standard)
	Socialdisner-diseases	Socialdisner-pharma	Socialdisner-morphology_neoplasms	Socialdisner-symptoms	Socialdisner-professions	Socialdisner-Procedures	Socialdisnerv-Person	Socialdisner-Species
# tweets	85077	1759	8518	12624	15831	11462	41033	12118
# characters	19920670	435141	2082574	3023784	4063114	2873791	10273278	2933925
# tokens	3236411	68269	332539	521503	660071	467059	1689479	486249
Avg. char / tweet	234.15	247.38	244.49	239.53	256.66	250.72	250.37	242.11
Avg. tok. / tweet	38.04	38.81	39.04	41.31	41.69	40.75	41.17	40.13
# mentions	116260	1029	8943	12896	18590	10080	58007	14014
# unique mentions	16034	530	541	6991	3667	3841	3446	1676

Do not share the data with other individuals/teams without permission from the task organizer. Tweets IDs are the primary source of information. Tweet texts are provided as support material. By downloading this resource, you agree to the Twitter Terms of Service, Privacy Policy, Developer Agreement, and Developer Policy.

Notes

Funded by the Plan de Impulso de las Tecnologías del Lenguaje (Plan TL).

Files

SocialDisNER_Data.zip

Files (97.8 MB)

Name	Size	Download all
SocialDisNER_Data.zip md5:0d1e1e2cba740265cb13ef66f045723b	12.0 MB	Preview Download
SocialDisNER_LargeScale_additionaldata.zip md5:251b9fa20e91932091fb95520b64dacc	85.8 MB	Preview Download

	All versions	This version
Views	2,934	903
Downloads	454	292
Data volume	20.8 GB	17.9 GB

SocialDisNER corpus: gold standard annotations for detection of disease mentions in Spanish tweets

Creators

Description

Notes

Files

SocialDisNER_Data.zip

Files (97.8 MB)