How to use the data#
Start by cloning the repository:
git clone https://github.com/neurodatascience/labelbuddy-annotations.git
cd labelbuddy-annotations
The documents, labels and annotations are stored in JSON or JSONLines files in the projects/
directory.
Building the database#
We can easily create a SQLite database containing all the information in the repository (but if you don’t like SQL you can use a CSV file instead):
make database
Or equivalently, e.g. if make
is not available:
python3 ./scripts/make_database.py
We can then use this database to query the contents of the repository, either with the sqlite3
interactive command:
sqlite3 analysis/data/database.sqlite3
or using SQLite bindings that are available in many languages, including Python’s standard library module sqlite3.
A small Python package containing a few utilities for working with this repository is also provided in analysis/labelrepo
. You can install it with
pip install -e "analysis/labelrepo"
(note the -e
, or --editable
, option).
The main tables are annotation
, document
and label
.
detailed_annotation
is a view gathering detailed information about each annotation, such as the selected_text
, a snippet of surrounding text (context
), the label_name
and annotator_name
, etc.
For example, to display a few annotations:
from labelrepo import database, displays
connection = database.get_database_connection()
annotations = connection.execute("SELECT * FROM detailed_annotation limit 5")
displays.AnnotationsDisplay(annotations)
As another example (this time collecting results in a Pandas DataFrame), selecting all snippets of text that have been annotated with “Diagnosis”:
import pandas as pd
pd.read_sql(
"""
SELECT selected_text, COUNT(*) as occurrences
FROM detailed_annotation
WHERE label_name = "Diagnosis"
GROUP BY selected_text
ORDER BY occurrences DESC
""",
connection,
)
selected_text | occurrences | |
---|---|---|
0 | schizophrenia | 1 |
1 | autism spectrum disorders | 1 |
2 | Mild traumatic brain injury (mTBI) | 1 |
Using a CSV rather than a database#
If you prefer working with CSVs than SQL, you can also run (at the root of the repository)
make csv
That will create a file analysis/data/detailed_annotation.csv
containing the detailed annotations table:
from labelrepo import repo
csv_file = repo.data_dir() / "detailed_annotation.csv"
annotations = pd.read_csv(csv_file, nrows=3)
displays.AnnotationsDisplay(annotations)
Here is all the information in that table for the first annotation:
try:
display = (
annotations.iloc[:1]
.stack()
.reset_index()
.style.hide("level_0", axis=1)
.hide(axis="index")
.hide(axis="columns")
)
except AttributeError:
# old pandas version
display = annotations.iloc[0]
display
pmcid | 10014826 |
pmid | 36937687 |
publication_year | 2023 |
journal | Front Neurosci |
title | Resting-state abnormalities in functional connectivity of the default mode network in migraine: A meta-analysis |
label_name | NO N studies found |
label_color | #ffbb78 |
annotator_name | Kendra_Oudyk |
start_char | 0 |
end_char | 2 |
extra_data | 76.000000 |
project_name | neuro-meta-analyses |
selected_text | Hu |
context | Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title R |
context_start_char | 0 |
context_end_char | 202 |
doc_length | 33400 |
doc_md5 | 8a1e4549746b2726299167899b3919ad |
Using the JSON and JSONLines files directly#
.jsonl
(JSONLines) files contain one JSON dictionary per line. They can
easily be parsed for example with the json Python standard library module. Moreover the
labelrepo
package contains a convenience function for parsing JSON or
JSONLines files:
from labelrepo import read_json
annotations_file = (
repo.repo_root()
/ "projects"
/ "participant_demographics"
/ "annotations"
/ "Jerome_Dockes.jsonl"
)
for row in read_json(annotations_file)[:3]:
print(row)
{'annotations': [{'end_byte': 9, 'end_char': 8, 'label_name': 'discard', 'start_byte': 0, 'start_char': 0}], 'metadata': {'chapter': 1, 'doi': '10.3389/fnsys.2015.00126', 'field_positions': {'abstract': [289, 1183], 'authors': [0, 59], 'body': [1192, 33477], 'journal': [60, 79], 'keywords': [167, 276], 'publication_year': [81, 85], 'title': [96, 153]}, 'page': 1, 'part': 1, 'pmcid': 4565057, 'pmid': 26441558, 'text_md5': 'aeacc3bc705b025b4f7aecea35058ca0'}, 'utf8_text_md5_checksum': 'aeacc3bc705b025b4f7aecea35058ca0'}
{'annotations': [{'end_byte': 9458, 'end_char': 9450, 'label_name': 'healthy', 'start_byte': 9447, 'start_char': 9439}, {'end_byte': 9458, 'end_char': 9450, 'extra_data': '29', 'label_name': 'count', 'start_byte': 9447, 'start_char': 9439}], 'metadata': {'chapter': 1, 'doi': '10.3390/brainsci10090603', 'field_positions': {'abstract': [334, 1735], 'authors': [0, 69], 'body': [1744, 27005], 'journal': [70, 79], 'keywords': [206, 321], 'publication_year': [81, 85], 'title': [96, 192]}, 'page': 2, 'part': 1, 'pmcid': 7563756, 'pmid': 32887487, 'text_md5': '57be0e414d83cf88df7d14070e3ea9dd'}, 'utf8_text_md5_checksum': '57be0e414d83cf88df7d14070e3ea9dd'}
{'annotations': [{'end_byte': 7985, 'end_char': 7967, 'label_name': 'count', 'start_byte': 7983, 'start_char': 7965}, {'end_byte': 309, 'end_char': 308, 'label_name': 'diagnosis', 'start_byte': 280, 'start_char': 279}], 'metadata': {'chapter': 1, 'doi': '10.1016/j.nicl.2016.07.006', 'field_positions': {'abstract': [448, 2844], 'authors': [0, 159], 'body': [2853, 39646], 'journal': [160, 175], 'keywords': [322, 435], 'publication_year': [177, 181], 'title': [192, 308]}, 'page': 3, 'part': 1, 'pmcid': 5030332, 'pmid': 27672554, 'text_md5': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}, 'utf8_text_md5_checksum': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}
Loading labels from a JSON file:
labels_file = (
repo.repo_root()
/ "projects"
/ "autism_mri"
/ "labels"
/ "Article_Terms.json"
)
read_json(labels_file)
[{'color': '#aec7e8', 'name': 'FieldStrength', 'shortcut_key': 'f'},
{'color': '#ffbb78', 'name': 'Diagnosis', 'shortcut_key': 'd'},
{'color': '#98df8a', 'name': 'N_Total', 'shortcut_key': 'n'},
{'color': '#ff9896', 'name': 'N_Total_Male'},
{'color': '#c5b0d5', 'name': 'N_Total_Female'},
{'color': '#dbdb8d', 'name': 'N_Patients'},
{'color': '#9edae5', 'name': 'N_Controls'},
{'color': '#aec7e8', 'name': 'N_Controls_Male'},
{'color': '#ffbb78', 'name': 'N_Controls_Female'},
{'color': '#98df8a', 'name': 'N_Patients_Male'},
{'color': '#ff9896', 'name': 'N_Patients_Female'},
{'color': '#c5b0d5', 'name': 'Age_Mean'},
{'color': '#c49c94', 'name': 'Age_Min'},
{'color': '#f7b6d2', 'name': 'Age_Max'},
{'color': '#dbdb8d', 'name': 'Scanner'},
{'color': '#9edae5', 'name': 'AnalysisTool'},
{'color': '#aec7e8', 'name': 'MRI_Modality'}]
Obtaining the full datasets#
This repository only contains the batches of documents currently being annotated.
These are typically part of a larger dataset, usually created with pubget.
It is possible to obtain the full dataset from which the annotated documents were drawn.
From the command-line this can be done with the download_datasets.py
script:
python3 ./scripts/download_datasets.py [ PROJECT NAME ]
In Python, it can be done with the labelrepo.datasets
module:
from labelrepo import datasets
project_datasets = datasets.get_project_datasets(project_name)
The datasets are stored in analysis/data/datasets/
.