How to use the data#
Start by cloning the repository:
git clone https://github.com/neurodatascience/labelbuddy-annotations.git
cd labelbuddy-annotations
The documents, labels and annotations are stored in JSON or JSONLines files in the projects/
directory.
Building the database#
We can easily create a SQLite database containing all the information in the repository (but if you don’t like SQL you can use a CSV file instead):
make database
Or equivalently, e.g. if make
is not available:
python3 ./scripts/make_database.py
We can then use this database to query the contents of the repository, either with the sqlite3
interactive command:
sqlite3 analysis/data/database.sqlite3
or using SQLite bindings that are available in many languages, including Python’s standard library module sqlite3.
A small Python package containing a few utilities for working with this repository is also provided in analysis/labelrepo
. You can install it with
pip install -e "analysis/labelrepo"
(note the -e
, or --editable
, option).
The main tables are annotation
, document
and label
.
detailed_annotation
is a view gathering detailed information about each annotation, such as the selected_text
, a snippet of surrounding text (context
), the label_name
and annotator_name
, etc.
For example, to display a few annotations:
from labelrepo import database, displays
connection = database.get_database_connection()
annotations = connection.execute("SELECT * FROM detailed_annotation limit 5")
displays.AnnotationsDisplay(annotations)
As another example (this time collecting results in a Pandas DataFrame), selecting all snippets of text that have been annotated with “Diagnosis”:
import pandas as pd
pd.read_sql(
"""
SELECT selected_text, COUNT(*) as occurrences
FROM detailed_annotation
WHERE label_name = "Diagnosis"
GROUP BY selected_text
ORDER BY occurrences DESC
""",
connection,
)
selected_text | occurrences | |
---|---|---|
0 | schizophrenia | 1 |
1 | autism spectrum disorders | 1 |
2 | Mild traumatic brain injury (mTBI) | 1 |
Using a CSV rather than a database#
If you prefer working with CSVs than SQL, you can also run (at the root of the repository)
make csv
That will create a file analysis/data/detailed_annotation.csv
containing the detailed annotations table:
from labelrepo import repo
csv_file = repo.data_dir() / "detailed_annotation.csv"
annotations = pd.read_csv(csv_file, nrows=3)
displays.AnnotationsDisplay(annotations)
Here is all the information in that table for the first annotation:
try:
display = (
annotations.iloc[:1]
.stack()
.reset_index()
.style.hide("level_0", axis=1)
.hide(axis="index")
.hide(axis="columns")
)
except AttributeError:
# old pandas version
display = annotations.iloc[0]
display
pmcid | 8334182 |
pmid | 34368270 |
publication_year | 2021 |
journal | Front Vet Sci |
title | Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians |
label_name | entity |
label_color | #729fcf |
annotator_name | Kendra_Oudyk |
start_char | 229 |
end_char | 264 |
project_name | NER_biomedical |
selected_text | Rapid Residue Detection Assay Data |
context | hian Dinani, Soudabeh and Millagaha Gedara, Nuwan Indika and Xu, Xuan and Richards, Emily and Maunsell, Fiona and Zad, Nader and Tell, Lisa A. Front Vet Sci, 2021 # Title Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians # Keywords MRL and tolerance commercial rapid assay test machine learning large scale data mining table extract |
context_start_char | 29 |
context_end_char | 464 |
doc_length | 34250 |
doc_md5 | 813c6fa5fd4080082d6a66ba4a2fee44 |
Using the JSON and JSONLines files directly#
.jsonl
(JSONLines) files contain one JSON dictionary per line. They can
easily be parsed for example with the json Python standard library module. Moreover the
labelrepo
package contains a convenience function for parsing JSON or
JSONLines files:
from labelrepo import read_json
annotations_file = (
repo.repo_root()
/ "projects"
/ "participant_demographics"
/ "annotations"
/ "Jerome_Dockes.jsonl"
)
for row in read_json(annotations_file)[:3]:
print(row)
{'annotations': [{'end_byte': 9, 'end_char': 8, 'label_name': 'discard', 'start_byte': 0, 'start_char': 0}], 'metadata': {'chapter': 1, 'doi': '10.3389/fnsys.2015.00126', 'field_positions': {'abstract': [289, 1183], 'authors': [0, 59], 'body': [1192, 33477], 'journal': [60, 79], 'keywords': [167, 276], 'publication_year': [81, 85], 'title': [96, 153]}, 'page': 1, 'part': 1, 'pmcid': 4565057, 'pmid': 26441558, 'text_md5': 'aeacc3bc705b025b4f7aecea35058ca0'}, 'utf8_text_md5_checksum': 'aeacc3bc705b025b4f7aecea35058ca0'}
{'annotations': [{'end_byte': 9458, 'end_char': 9450, 'label_name': 'healthy', 'start_byte': 9447, 'start_char': 9439}, {'end_byte': 9458, 'end_char': 9450, 'extra_data': '29', 'label_name': 'count', 'start_byte': 9447, 'start_char': 9439}], 'metadata': {'chapter': 1, 'doi': '10.3390/brainsci10090603', 'field_positions': {'abstract': [334, 1735], 'authors': [0, 69], 'body': [1744, 27005], 'journal': [70, 79], 'keywords': [206, 321], 'publication_year': [81, 85], 'title': [96, 192]}, 'page': 2, 'part': 1, 'pmcid': 7563756, 'pmid': 32887487, 'text_md5': '57be0e414d83cf88df7d14070e3ea9dd'}, 'utf8_text_md5_checksum': '57be0e414d83cf88df7d14070e3ea9dd'}
{'annotations': [{'end_byte': 7985, 'end_char': 7967, 'label_name': 'count', 'start_byte': 7983, 'start_char': 7965}, {'end_byte': 309, 'end_char': 308, 'label_name': 'diagnosis', 'start_byte': 280, 'start_char': 279}], 'metadata': {'chapter': 1, 'doi': '10.1016/j.nicl.2016.07.006', 'field_positions': {'abstract': [448, 2844], 'authors': [0, 159], 'body': [2853, 39646], 'journal': [160, 175], 'keywords': [322, 435], 'publication_year': [177, 181], 'title': [192, 308]}, 'page': 3, 'part': 1, 'pmcid': 5030332, 'pmid': 27672554, 'text_md5': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}, 'utf8_text_md5_checksum': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}
Loading labels from a JSON file:
labels_file = (
repo.repo_root()
/ "projects"
/ "autism_mri"
/ "labels"
/ "Article_Terms.json"
)
read_json(labels_file)
[{'color': '#aec7e8', 'name': 'FieldStrength', 'shortcut_key': 'f'},
{'color': '#ffbb78', 'name': 'Diagnosis', 'shortcut_key': 'd'},
{'color': '#98df8a', 'name': 'N_Total', 'shortcut_key': 'n'},
{'color': '#ff9896', 'name': 'N_Total_Male'},
{'color': '#c5b0d5', 'name': 'N_Total_Female'},
{'color': '#dbdb8d', 'name': 'N_Patients'},
{'color': '#9edae5', 'name': 'N_Controls'},
{'color': '#aec7e8', 'name': 'N_Controls_Male'},
{'color': '#ffbb78', 'name': 'N_Controls_Female'},
{'color': '#98df8a', 'name': 'N_Patients_Male'},
{'color': '#ff9896', 'name': 'N_Patients_Female'},
{'color': '#c5b0d5', 'name': 'Age_Mean'},
{'color': '#c49c94', 'name': 'Age_Min'},
{'color': '#f7b6d2', 'name': 'Age_Max'},
{'color': '#dbdb8d', 'name': 'Scanner'},
{'color': '#9edae5', 'name': 'AnalysisTool'},
{'color': '#aec7e8', 'name': 'MRI_Modality'}]
Obtaining the full datasets#
This repository only contains the batches of documents currently being annotated.
These are typically part of a larger dataset, usually created with pubget.
It is possible to obtain the full dataset from which the annotated documents were drawn.
From the command-line this can be done with the download_datasets.py
script:
python3 ./scripts/download_datasets.py [ PROJECT NAME ]
In Python, it can be done with the labelrepo.datasets
module:
from labelrepo import datasets
project_datasets = datasets.get_project_datasets(project_name)
The datasets are stored in analysis/data/datasets/
.