How to use the data#

Start by cloning the repository:

git clone https://github.com/neurodatascience/labelbuddy-annotations.git
cd labelbuddy-annotations

The documents, labels and annotations are stored in JSON or JSONLines files in the projects/ directory.

Building the database#

We can easily create a SQLite database containing all the information in the repository (but if you don’t like SQL you can use a CSV file instead):

make database

Or equivalently, e.g. if make is not available:

python3 ./scripts/make_database.py

We can then use this database to query the contents of the repository, either with the sqlite3 interactive command:

sqlite3 analysis/data/database.sqlite3

or using SQLite bindings that are available in many languages, including Python’s standard library module sqlite3.

A small Python package containing a few utilities for working with this repository is also provided in analysis/labelrepo. You can install it with

pip install -e "analysis/labelrepo"

(note the -e, or --editable, option).

The main tables are annotation, document and label. detailed_annotation is a view gathering detailed information about each annotation, such as the selected_text, a snippet of surrounding text (context), the label_name and annotator_name, etc. For example, to display a few annotations:

from labelrepo import database, displays

connection = database.get_database_connection()

annotations = connection.execute("SELECT * FROM detailed_annotation limit 5")
displays.AnnotationsDisplay(annotations)
entity
…hian Dinani, Soudabeh and Millagaha Gedara, Nuwan Indika and Xu, Xuan and Richards, Emily and Maunsell, Fiona and Zad, Nader and Tell, Lisa A. Front Vet Sci, 2021 # Title Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians # Keywords MRL and tolerance commercial rapid assay test machine learning large scale data mining table extract…
NER method
…extracted as BeatifulSoup object. Pandas read_html as a data analysis miner and manipulation and powerful web scraping tool for URL protocols was used to harvest data from HTML tables. ### Regular Expression Learning for Information Extraction To briefly explain the high-throughput regular-expression pattern matching method, we have implemented some similarity methods from Regular Expression techniques in our…
location
…for drug names or other fields were observed while extracting data tables. ### Desired Information From Structured or Unstructured Documents Below we review multiple cases to extract data from tables. For these cases, it is required to check if the keywords determined important in the real-time data collection via PDF and webpage parsing are clearly characterized in the extracted tables provided …
NER method
…he names of repeated columns should be consistent. Here we similarly attempted to authenticate the input string of each field using regular expression matching to cover more cases in our queries. Dictionary for synonyms or corresponding names considered for each field. Names are followed by some regular expressions to ensure correct field extractions . ### Extracting Semistructured Information From the We…
location
…g beta-lactams, tetracyclines, aminoglycosides, and sulfonamides from the website ( ). Therefore, using our trained model based on the Python packages of requests and BeautifulSoup , all the rapid assay URL links for dairy tests are parsed and automatically examined for potentially available tables on each page. Below we presented an example of adaptable parsing of real-time data extracting. When the query pin…

As another example (this time collecting results in a Pandas DataFrame), selecting all snippets of text that have been annotated with “Diagnosis”:

import pandas as pd

pd.read_sql(
    """
    SELECT selected_text, COUNT(*) as occurrences
    FROM detailed_annotation
    WHERE label_name = "Diagnosis"
    GROUP BY selected_text
    ORDER BY occurrences DESC
    """,
    connection,
)
selected_text occurrences
0 schizophrenia 1
1 autism spectrum disorders 1
2 Mild traumatic brain injury (mTBI) 1

Using a CSV rather than a database#

If you prefer working with CSVs than SQL, you can also run (at the root of the repository)

make csv

That will create a file analysis/data/detailed_annotation.csv containing the detailed annotations table:

from labelrepo import repo

csv_file = repo.data_dir() / "detailed_annotation.csv"
annotations = pd.read_csv(csv_file, nrows=3)
displays.AnnotationsDisplay(annotations)
entity
…hian Dinani, Soudabeh and Millagaha Gedara, Nuwan Indika and Xu, Xuan and Richards, Emily and Maunsell, Fiona and Zad, Nader and Tell, Lisa A. Front Vet Sci, 2021 # Title Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians # Keywords MRL and tolerance commercial rapid assay test machine learning large scale data mining table extract…
NER method
…extracted as BeatifulSoup object. Pandas read_html as a data analysis miner and manipulation and powerful web scraping tool for URL protocols was used to harvest data from HTML tables. ### Regular Expression Learning for Information Extraction To briefly explain the high-throughput regular-expression pattern matching method, we have implemented some similarity methods from Regular Expression techniques in our…
location
…for drug names or other fields were observed while extracting data tables. ### Desired Information From Structured or Unstructured Documents Below we review multiple cases to extract data from tables. For these cases, it is required to check if the keywords determined important in the real-time data collection via PDF and webpage parsing are clearly characterized in the extracted tables provided …

Here is all the information in that table for the first annotation:

try:
    display = (
    annotations.iloc[:1]
    .stack()
    .reset_index()
    .style.hide("level_0", axis=1)
    .hide(axis="index")
    .hide(axis="columns")
)
except AttributeError:
    # old pandas version
    display = annotations.iloc[0]

display     
pmcid 8334182
pmid 34368270
publication_year 2021
journal Front Vet Sci
title Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians
label_name entity
label_color #729fcf
annotator_name Kendra_Oudyk
start_char 229
end_char 264
project_name NER_biomedical
selected_text Rapid Residue Detection Assay Data
context hian Dinani, Soudabeh and Millagaha Gedara, Nuwan Indika and Xu, Xuan and Richards, Emily and Maunsell, Fiona and Zad, Nader and Tell, Lisa A. Front Vet Sci, 2021 # Title Large-Scale Data Mining of Rapid Residue Detection Assay Data From HTML and PDF Documents: Improving Data Access and Visualization for Veterinarians # Keywords MRL and tolerance commercial rapid assay test machine learning large scale data mining table extract
context_start_char 29
context_end_char 464
doc_length 34250
doc_md5 813c6fa5fd4080082d6a66ba4a2fee44

Using the JSON and JSONLines files directly#

.jsonl (JSONLines) files contain one JSON dictionary per line. They can easily be parsed for example with the json Python standard library module. Moreover the labelrepo package contains a convenience function for parsing JSON or JSONLines files:

from labelrepo import read_json

annotations_file = (
    repo.repo_root()
    / "projects"
    / "participant_demographics"
    / "annotations"
    / "Jerome_Dockes.jsonl"
)

for row in read_json(annotations_file)[:3]:
    print(row)
{'annotations': [{'end_byte': 9, 'end_char': 8, 'label_name': 'discard', 'start_byte': 0, 'start_char': 0}], 'metadata': {'chapter': 1, 'doi': '10.3389/fnsys.2015.00126', 'field_positions': {'abstract': [289, 1183], 'authors': [0, 59], 'body': [1192, 33477], 'journal': [60, 79], 'keywords': [167, 276], 'publication_year': [81, 85], 'title': [96, 153]}, 'page': 1, 'part': 1, 'pmcid': 4565057, 'pmid': 26441558, 'text_md5': 'aeacc3bc705b025b4f7aecea35058ca0'}, 'utf8_text_md5_checksum': 'aeacc3bc705b025b4f7aecea35058ca0'}
{'annotations': [{'end_byte': 9458, 'end_char': 9450, 'label_name': 'healthy', 'start_byte': 9447, 'start_char': 9439}, {'end_byte': 9458, 'end_char': 9450, 'extra_data': '29', 'label_name': 'count', 'start_byte': 9447, 'start_char': 9439}], 'metadata': {'chapter': 1, 'doi': '10.3390/brainsci10090603', 'field_positions': {'abstract': [334, 1735], 'authors': [0, 69], 'body': [1744, 27005], 'journal': [70, 79], 'keywords': [206, 321], 'publication_year': [81, 85], 'title': [96, 192]}, 'page': 2, 'part': 1, 'pmcid': 7563756, 'pmid': 32887487, 'text_md5': '57be0e414d83cf88df7d14070e3ea9dd'}, 'utf8_text_md5_checksum': '57be0e414d83cf88df7d14070e3ea9dd'}
{'annotations': [{'end_byte': 7985, 'end_char': 7967, 'label_name': 'count', 'start_byte': 7983, 'start_char': 7965}, {'end_byte': 309, 'end_char': 308, 'label_name': 'diagnosis', 'start_byte': 280, 'start_char': 279}], 'metadata': {'chapter': 1, 'doi': '10.1016/j.nicl.2016.07.006', 'field_positions': {'abstract': [448, 2844], 'authors': [0, 159], 'body': [2853, 39646], 'journal': [160, 175], 'keywords': [322, 435], 'publication_year': [177, 181], 'title': [192, 308]}, 'page': 3, 'part': 1, 'pmcid': 5030332, 'pmid': 27672554, 'text_md5': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}, 'utf8_text_md5_checksum': '9eef6f93a376b2d0833d2fe6dc4fb6a3'}

Loading labels from a JSON file:

labels_file = (
    repo.repo_root()
    / "projects"
    / "autism_mri"
    / "labels"
    / "Article_Terms.json"
)
read_json(labels_file)
[{'color': '#aec7e8', 'name': 'FieldStrength', 'shortcut_key': 'f'},
 {'color': '#ffbb78', 'name': 'Diagnosis', 'shortcut_key': 'd'},
 {'color': '#98df8a', 'name': 'N_Total', 'shortcut_key': 'n'},
 {'color': '#ff9896', 'name': 'N_Total_Male'},
 {'color': '#c5b0d5', 'name': 'N_Total_Female'},
 {'color': '#dbdb8d', 'name': 'N_Patients'},
 {'color': '#9edae5', 'name': 'N_Controls'},
 {'color': '#aec7e8', 'name': 'N_Controls_Male'},
 {'color': '#ffbb78', 'name': 'N_Controls_Female'},
 {'color': '#98df8a', 'name': 'N_Patients_Male'},
 {'color': '#ff9896', 'name': 'N_Patients_Female'},
 {'color': '#c5b0d5', 'name': 'Age_Mean'},
 {'color': '#c49c94', 'name': 'Age_Min'},
 {'color': '#f7b6d2', 'name': 'Age_Max'},
 {'color': '#dbdb8d', 'name': 'Scanner'},
 {'color': '#9edae5', 'name': 'AnalysisTool'},
 {'color': '#aec7e8', 'name': 'MRI_Modality'}]

Obtaining the full datasets#

This repository only contains the batches of documents currently being annotated. These are typically part of a larger dataset, usually created with pubget. It is possible to obtain the full dataset from which the annotated documents were drawn. From the command-line this can be done with the download_datasets.py script:

python3 ./scripts/download_datasets.py [ PROJECT NAME ]

In Python, it can be done with the labelrepo.datasets module:

from labelrepo import datasets
project_datasets = datasets.get_project_datasets(project_name)

The datasets are stored in analysis/data/datasets/.