Mining the biomedical scientific literature

We present a set of tools to easily collect and prepare biomedical publications for text-mining. Skip the tedious data collection and wrangling; focus on analyzing the text.

The biomedical literature comprises millions of articles and is expanding quickly. Due to this overwhelming size, using it efficiently often involves systematic or automated methods, collectively known as text-mining. A text-mining project involves several steps: finding relevant articles, downloading them, extracting their text, extracting information from the text and finally running the main analysis which yields scientific insights. More often than not, the first steps of this process – data collection and curation – take a frustrating amount of time and effort, and are performed in a way that is difficult to reproduce or extend later.

Here, we describe useful tools and a simple workflow that make text-mining of the biomedical literature easier, more transparent and more fun. We hope to help you streamline the first tedious steps of your text-mining project and focus on the part you care about: extracting and analyzing high-quality information from text, rather than downloading, parsing and pre-processing thousands of articles.

Here is an overview of our suggested workflow, along with the tools we offer and possible places to store the output of each step.

Task
Tool
Output
Corpus collection & content extraction
OSF
↓
Manual annotation
and/or
Automatic information extraction
pubget, pubextract, custom code
GitHub
↓
Analysis
Custom code
GitHub

Pubget

Pubget is a command-line program to obtain and process full-text articles from PubMed Central (PMC). For example, here is how we would obtain all the articles that mention “zika” in their abstract:

pubget run -q "zika[Abstract]" ./pubget_data

The pubget documentation provides a detailed description of all the options and outputs, and pubget --help produces a summary. After downloading all the articles, pubget extracts their content, including the text, abstract, metadata fields such as keywords and publication date, etc. This data is stored in CSV files that are easily loaded to analyze all the articles jointly.

pubget also has some features specific to meta-analysis of neuroimaging studies, such as the extraction of stereotactic coordinates.

Labelbuddy

In a text-mining project, we often need some manual annotations. For example, if we create a method for automatically extracting information from the text, we need labelled documents at least for validation (evaluating how well the automatic extraction performs). Labelbuddy is a simple, effective and flexible tool for performing this task.

Conveniently, pubget can generate JSON files containing the text it downloaded, simply by adding the --labelbuddy flag to the pubget command. Therefore, we can start annotating our articles very easily. For more details about labelbuddy, see its documentation.

The labelbuddy-annotations repository

If you annotate biomedical text with labelbuddy, we encourage you to share the annotations in this repository. This will facilitate re-use of annotations in different analyses, and collaboration across projects (eg different projects annotating the same articles with different types of information). As a bonus, you get a few utilities to work with the annotations, and you can showcase your project in the repository’s documentation.

Workflow

Data collection, labelling and cleaning tend to be messy. Still, we want the process to be as streamlined, transparent and reproducible as possible. Here is an overview of the organization we suggest; more technical details are provided in the labelbuddy-annotations repository’s documentation.

Define a query

Visit PubMed Central to help you choose the query (search terms) you will use to select which articles to download. You can follow the “Advanced” link to build a query with an advanced search dialogue. Once you are satisfied with the results returned, copy the query that appears in the search bar or in the “Search details” box on the right and write it in a text file, say query.txt.

Download & process articles with pubget

Now you have your query you can download all the corresponding articles with pubget using the --query_file option (or the shorter version -f). You must also specify the directory in which pubget will store all its data; here we use pubget_data. Finally, we pass the --labelbuddy option to tell pubget to generate JSON files that we can later import into labelbuddy.

pubget run --labelbuddy -f ./query.txt ./pubget_data/

There are several other useful options, such as --n_jobs to run the content extraction in parallel or --alias to create a symbolic link with a human-readable name to the directory pubget will create for our query:

pubget run --labelbuddy --n_jobs 4 --alias "my_project" -f ./query.txt ./pubget_data/

The pubget documentation provides a detailed description of all the options and outputs, and pubget --help produces a summary.

Store the full pubget output online

Pubget will create a directory containing all the articles that match your query on PMC, the text and information it extracted, and the labelbuddy files. To make sure you keep and are able to share all relevant data for your project, we recommend creating a (.tar.gz) archive from this directory and uploading it to an online platform such as OSF (“Open Science Framework”).

Manually annotate some documents

Most likely, your project will involve some kind of information extraction – for example you may need to extract information about the study’s participants, drug or disease names, etc. A wide range of techniques and tools are at your disposal, and at this point you are getting to the interesting part of your project. But regardless of the approach you choose, you will need to verify that it produces satisfying results on your corpus of text. The most common way of assessing this is to compare its output to “ground truth” labels from a human expert (probably yourself). Also, if the information you are looking for is difficult to extract automatically and if your corpus is small, you might rely fully on manual labelling instead of using automated methods.

Fortunately, labelling the documents we just downloaded will be very easy. Pubget created a subdirectory called subset_allArticles_labelbuddyData, which contains files named documents_00001.jsonl, documents_00002.jsonl etc. Each of these files can be imported into labelbuddy and contains a batch of 200 articles. Inside that directory, create a labelbuddy database for your project and import a first batch of articles. This can be done from labelbuddy’s graphical interface or with the following command:

# create database and import documents
labelbuddy myproject.labelbuddy --import-docs ./documents_00001.jsonl
# launch labelbuddy
labelbuddy myproject.labelbuddy

Now you can create some labels and start annotating documents. You can refer to labelbuddy’s documentation (also available through the “Help” menu in labelbuddy).

Once you have annotated some documents, export the annotations to a JSON file (either from the “Import & Export menu” or with the labelbuddy export-docs command) . This will allow you to track the annotations in Git, but also to easily read them with tools such as Python or R for your analyses.

Consider adding a project to the labelbuddy-annotations repository

We encourage you to share any annotations you create in the labelbuddy-annotations repository. To do so, you simply need to fork the repository, and add a subdirectory in projects in which you store the labelbuddy file you are annotating, the labels you created in labelbuddy and your annotations. The repository’s documentation provides all the details to get started.

Adding your annotations to this repository has several advantages. It makes your annotations visible online, easy to re-use and combine with other projects. The repository also contains some utilities to work with the annotations. For example, it can compile a CSV file that combines all the annotations from all projects, along with information about the articles. See the documentation on using the repository for details.

Use your text corpus and manual annotations for analysis

Now you have your text corpus and your manual annotations in convenient formats, ready to load with your favorite scientific tools and analyze. The real work can begin!