How to use the data#

Start by cloning the repository:

git clone https://github.com/neurodatascience/labelbuddy-annotations.git
cd labelbuddy-annotations

The documents, labels and annotations are stored in JSON or JSONLines files in the projects/ directory.

Building the database#

We can easily create a SQLite database containing all the information in the repository (but if you don’t like SQL you can use a CSV file instead):

make database

Or equivalently, e.g. if make is not available:

python3 ./scripts/make_database.py

We can then use this database to query the contents of the repository, either with the sqlite3 interactive command:

sqlite3 analysis/data/database.sqlite3

or using SQLite bindings that are available in many languages, including Python’s standard library module sqlite3.

A small Python package containing a few utilities for working with this repository is also provided in analysis/labelrepo. You can install it with

pip install -e "analysis/labelrepo"

(note the -e, or --editable, option).

The main tables are annotation, document and label. detailed_annotation is a view gathering detailed information about each annotation, such as the selected_text, a snippet of surrounding text (context), the label_name and annotator_name, etc. For example, to display a few annotations:

from labelrepo import database, displays

connection = database.get_database_connection()

annotations = connection.execute("SELECT * FROM detailed_annotation limit 5")
displays.AnnotationsDisplay(annotations)
NO N studies found
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title R…
NO N studies found
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title R…
DONE
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title Resti…
DONE
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title Resti…
no null results
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title Resting-s…

As another example (this time collecting results in a Pandas DataFrame), selecting all snippets of text that have been annotated with “Diagnosis”:

import pandas as pd

pd.read_sql(
    """
    SELECT selected_text, COUNT(*) as occurrences
    FROM detailed_annotation
    WHERE label_name = "Diagnosis"
    GROUP BY selected_text
    ORDER BY occurrences DESC
    """,
    connection,
)
selected_text occurrences
0 schizophrenia 1
1 autism spectrum disorders 1
2 Mild traumatic brain injury (mTBI) 1

Using a CSV rather than a database#

If you prefer working with CSVs than SQL, you can also run (at the root of the repository)

make csv

That will create a file analysis/data/detailed_annotation.csv containing the detailed annotations table:

from labelrepo import repo

csv_file = repo.data_dir() / "detailed_annotation.csv"
annotations = pd.read_csv(csv_file, nrows=3)
displays.AnnotationsDisplay(annotations)
NO N studies found
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title R…
NO N studies found
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title R…
DONE
Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title Resti…

Here is all the information in that table for the first annotation:

try:
    display = (
    annotations.iloc[:1]
    .stack()
    .reset_index()
    .style.hide("level_0", axis=1)
    .hide(axis="index")
    .hide(axis="columns")
)
except AttributeError:
    # old pandas version
    display = annotations.iloc[0]

display     
pmcid 10014826
pmid 36937687
publication_year 2023
journal Front Neurosci
title Resting-state abnormalities in functional connectivity of the default mode network in migraine: A meta-analysis
label_name NO N studies found
label_color #ffbb78
annotator_name Kendra_Oudyk
start_char 0
end_char 2
extra_data 76.000000
project_name neuro-meta-analyses
selected_text Hu
context Hu, Su and Hao, Zeqi and Li, Mengting and Zhao, Mengqi and Wen, Jianjie and Gao, Yanyan and Wang, Qing and Xi, Hongyu and Antwi, Collins Opoku and Jia, Xize and Ren, Jun Front Neurosci, 2023 # Title R
context_start_char 0
context_end_char 202
doc_length 33400
doc_md5 8a1e4549746b2726299167899b3919ad

Using the JSON and JSONLines files directly#

.jsonl (JSONLines) files contain one JSON dictionary per line. They can easily be parsed for example with the json Python standard library module. Moreover the labelrepo package contains a convenience function for parsing JSON or JSONLines files:

from labelrepo import read_json

annotations_file = (
    repo.repo_root()
    / "projects"
    / "participant_demographics"
    / "annotations"
    / "Jerome_Dockes.jsonl"
)

for row in read_json(annotations_file)[:3]:
    print(row)
{'annotations': [{'end_byte': 9, 'end_char': 8, 'label_name': 'discard', 'start_byte': 0, 'start_char': 0}], 'metadata': {'chapter': 1, 'doi': '10.3389/fnsys.2015.00126', 'field_positions': {'abstract': [289, 1183], 'authors': [0, 59], 'body': [1192, 33477], 'journal': [60, 79], 'keywords': [167, 276], 'publication_year': [81, 85], 'title': [96, 153]}, 'page': 1, 'part': 1, 'pmcid': 4565057, 'pmid': 26441558, 'text_md5': 'aeacc3bc705b025b4f7aecea35058ca0'}, 'utf8_text_md5_checksum': 'aeacc3bc705b025b4f7aecea35058ca0'}
{'annotations': [{'end_byte': 9458, 'end_char': 9450, 'label_name': 'healthy', 'start_byte': 9447, 'start_char': 9439}, {'end_byte': 9458, 'end_char': 9450, 'extra_data': '29', 'label_name': 'count', 'start_byte': 9447, 'start_char': 9439}], 'metadata': {'chapter': 1, 'doi': '10.3390/brainsci10090603', 'field_positions': {'abstract': [334, 1735], 'authors': [0, 69], 'body': [1744, 27005], 'journal': [70, 79], 'keywords': [206, 321], 'publication_year': [81, 85], 'title': [96, 192]}, 'page': 2, 'part': 1, 'pmcid': 7563756, 'pmid': 32887487, 'text_md5': '57be0e414d83cf88df7d14070e3ea9dd'}, 'utf8_text_md5_checksum': '57be0e414d83cf88df7d14070e3ea9dd'}
{'annotations': [{'end_byte': 7985, 'end_char': 7967, 'label_name': 'count', 'start_byte': 7983, 'start_char': 7965}, {'end_byte': 309, 'end_char': 308, 'label_name': 'diagnosis', 'start_byte': 280, 'start_char': 279}], 'metadata': {'chapter': 1, 'doi': '10.1016/j.nicl.2016.07.006', 'field_positions': {'abstract': [448, 2844], 'authors': [0, 159], 'body': [2853, 39646], 'journal': [160, 175], 'keywords': [322, 435], 'publication_year': [177, 181], 'title': [192, 308]}, 'page': 3, 'part': 1, 'pmcid': 5030332, 'pmid': 27672554, 'text_md5': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}, 'utf8_text_md5_checksum': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}

Loading labels from a JSON file:

labels_file = (
    repo.repo_root()
    / "projects"
    / "autism_mri"
    / "labels"
    / "Article_Terms.json"
)
read_json(labels_file)
[{'color': '#aec7e8', 'name': 'FieldStrength', 'shortcut_key': 'f'},
 {'color': '#ffbb78', 'name': 'Diagnosis', 'shortcut_key': 'd'},
 {'color': '#98df8a', 'name': 'N_Total', 'shortcut_key': 'n'},
 {'color': '#ff9896', 'name': 'N_Total_Male'},
 {'color': '#c5b0d5', 'name': 'N_Total_Female'},
 {'color': '#dbdb8d', 'name': 'N_Patients'},
 {'color': '#9edae5', 'name': 'N_Controls'},
 {'color': '#aec7e8', 'name': 'N_Controls_Male'},
 {'color': '#ffbb78', 'name': 'N_Controls_Female'},
 {'color': '#98df8a', 'name': 'N_Patients_Male'},
 {'color': '#ff9896', 'name': 'N_Patients_Female'},
 {'color': '#c5b0d5', 'name': 'Age_Mean'},
 {'color': '#c49c94', 'name': 'Age_Min'},
 {'color': '#f7b6d2', 'name': 'Age_Max'},
 {'color': '#dbdb8d', 'name': 'Scanner'},
 {'color': '#9edae5', 'name': 'AnalysisTool'},
 {'color': '#aec7e8', 'name': 'MRI_Modality'}]

Obtaining the full datasets#

This repository only contains the batches of documents currently being annotated. These are typically part of a larger dataset, usually created with pubget. It is possible to obtain the full dataset from which the annotated documents were drawn. From the command-line this can be done with the download_datasets.py script:

python3 ./scripts/download_datasets.py [ PROJECT NAME ]

In Python, it can be done with the labelrepo.datasets module:

from labelrepo import datasets
project_datasets = datasets.get_project_datasets(project_name)

The datasets are stored in analysis/data/datasets/.