Participant demographics

Participant demographics#

We started annotating information about each study’s participants: their number, sex, age, and possible diagnosis. More annotations are needed, but these annotations can already give an approximation of the number of participants in typical fMRI studies, and be used to validate systems that aim to extract that information automatically. You can read more about the participant annotations in the corresponding project page.

The labelrepo package provides helpers to work with these annotations. Here we illustrate loading the annotations to make a couple of simple plots. The get_participant_demographics function returns a pd.DataFrame in which each row corresponds to a group of participants in a study, according to one annotator. The data used here is also available directly as a CSV file.

from labelrepo.projects.participant_demographics import (
    get_participant_demographics,
)

subgroups = get_participant_demographics()
subgroups.iloc[0]

group_name                              patients
subgroup_name                                 va
project_name                          autism_mri
annotator_name                     David_Kennedy
pmcid                                    9230060
start_char                                    []
end_char                                      []
diagnosis         criminals committing affective
count                                          6
male count                                   6.0
age mean                                     NaN
female count                                 NaN
age minimum                                  NaN
age maximum                                  NaN
age median                                   NaN
Name: 0, dtype: object

In this dataframe, each row contains the information collected by one annotator about a subgroup of participants in a study. You can read more about the nature of the annotations in the project page. The participants are divided into groups (“patients” or “healthy”), then each group is divided into an arbitrary number of subgroups (usually there is only one, but there can be several, eg “adolescents” and “adults” within the “healthy” group), and finally each subgroup is divided into “females” and “males”. The summary dataframe provides information at the level of the subgroup: its name, the group it belongs to, the number of males and females in the subgroup.

If several annotators have annotated a paper, the information each of them has annotated about each participant subgroup is stored in a separate row. We don’t want to end up counting participant groups several times so here we simply keep the output of the first annotator for each study.

subset = ["pmcid", "project_name", "annotator_name"]

kept_annotators = set(
    subgroups.loc[:, subset]
    .drop_duplicates(subset=("pmcid",))
    .itertuples(index=False),
)
subgroups = subgroups.loc[
    [
        r in kept_annotators
        for r in subgroups.loc[:, subset].itertuples(index=False)
    ]
]

The labelrepo package also provides a way to load information about documents, labels and annotations in the repository in general. Here we use it to read some complementary information about each paper. We also add the PMC url to turn scatter plot points into links later. Finally, we turn the publication year into a datetime to facilitate using it in plots.

from labelrepo.database import get_database_connection

import pandas as pd

docs_info = pd.read_sql(
    "select pmcid, publication_year, title from document",
    get_database_connection(),
)
subgroups = subgroups.merge(docs_info, on="pmcid")
subgroups["pmc_url"] = [
    f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmcid}/"
    for pmcid in subgroups["pmcid"].values
]
subgroups["publication_date"] = pd.to_datetime(
    subgroups["publication_year"], format="%Y"
)

Let’s start by looking at the total number of participants in each study.

import altair

total_counts = (
    subgroups.groupby(["pmcid", "pmc_url", "publication_date"])["count"]
    .sum()
    .reset_index()
)

altair.Chart(total_counts).mark_point(size=60).encode(
    x="publication_date:T",
    y=altair.Y("count", scale=altair.Scale(type="log")),
    tooltip=["pmcid"],
    href="pmc_url:N",
).interactive()

As we do not have too many annotated papers yet, we can look at them individually and inspect the age means and ranges.

We start by selecting groups for which the mean age is known.

data = subgroups.dropna(subset=("age mean",)).copy()

The rest is not especially interesting, it is just configuring the plot.

data["subgroup_idx"] = data.groupby("pmcid").cumcount().values
ax = altair.Axis(ticks=False, grid=False, labels=False)
scale = altair.Scale(domain=[-1, 2])
tooltip = [
    "pmcid",
    "group_name",
    "subgroup_name",
    "female count",
    "male count",
    "diagnosis",
]
point = (
    altair.Chart(height=35)
    .mark_point()
    .encode(
        x="age mean",
        y=altair.Y("subgroup_idx", axis=ax, title=None, scale=scale),
        color="group_name",
        size="count",
        href="pmc_url:N",
        tooltip=tooltip,
    )
)
rule = (
    altair.Chart()
    .mark_rule()
    .encode(
        altair.X("age minimum"),
        altair.X2("age maximum"),
        y=altair.Y("subgroup_idx", axis=ax, title=None, scale=scale),
        color="group_name",
        href="pmc_url:N",
        tooltip=tooltip,
    )
)
altair.layer(point, rule, data=data).facet(
    row=altair.Row(
        "pmcid:N",
        header=altair.Header(labelAngle=0, labelAlign="left"),
        sort=altair.Sort(field="age mean", op="mean"),
    ),
).configure_facet(spacing=0).configure_view(stroke=None)

In the plot above,

The position of each circle represents the mean of the corresponding subgroup
The bar represents the age range (when it is known)
The size of the circle represents the number of participants in the subgroup.