Participant demographics#
We started annotating information about each study’s participants: their number, sex, age, and possible diagnosis. More annotations are needed, but these annotations can already give an approximation of the number of participants in typical fMRI studies, and be used to validate systems that aim to extract that information automatically. You can read more about the participant annotations in the corresponding project page.
The labelrepo
package provides helpers to work with these annotations.
Here we illustrate loading the annotations to make a couple of simple plots.
The get_participant_demographics
function returns a pd.DataFrame
in which each row corresponds to a group of participants in a study, according to one annotator.
The data used here is also available directly as a CSV file.
from labelrepo.projects.participant_demographics import (
get_participant_demographics,
)
subgroups = get_participant_demographics()
subgroups.iloc[0]
group_name patients
subgroup_name va
project_name autism_mri
annotator_name David_Kennedy
pmcid 9230060
start_char []
end_char []
diagnosis criminals committing affective
count 6
male count 6.0
age mean NaN
female count NaN
age minimum NaN
age maximum NaN
age median NaN
Name: 0, dtype: object
In this dataframe, each row contains the information collected by one annotator about a subgroup of participants in a study. You can read more about the nature of the annotations in the project page. The participants are divided into groups (“patients” or “healthy”), then each group is divided into an arbitrary number of subgroups (usually there is only one, but there can be several, eg “adolescents” and “adults” within the “healthy” group), and finally each subgroup is divided into “females” and “males”. The summary dataframe provides information at the level of the subgroup: its name, the group it belongs to, the number of males and females in the subgroup.
If several annotators have annotated a paper, the information each of them has annotated about each participant subgroup is stored in a separate row. We don’t want to end up counting participant groups several times so here we simply keep the output of the first annotator for each study.
subset = ["pmcid", "project_name", "annotator_name"]
kept_annotators = set(
subgroups.loc[:, subset]
.drop_duplicates(subset=("pmcid",))
.itertuples(index=False),
)
subgroups = subgroups.loc[
[
r in kept_annotators
for r in subgroups.loc[:, subset].itertuples(index=False)
]
]
The labelrepo
package also provides a way to load information about documents, labels and annotations in the repository in general.
Here we use it to read some complementary information about each paper.
We also add the PMC url to turn scatter plot points into links later.
Finally, we turn the publication year into a datetime
to facilitate using it in plots.
from labelrepo.database import get_database_connection
import pandas as pd
docs_info = pd.read_sql(
"select pmcid, publication_year, title from document",
get_database_connection(),
)
subgroups = subgroups.merge(docs_info, on="pmcid")
subgroups["pmc_url"] = [
f"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC{pmcid}/"
for pmcid in subgroups["pmcid"].values
]
subgroups["publication_date"] = pd.to_datetime(
subgroups["publication_year"], format="%Y"
)
Let’s start by looking at the total number of participants in each study.
import altair
total_counts = (
subgroups.groupby(["pmcid", "pmc_url", "publication_date"])["count"]
.sum()
.reset_index()
)
altair.Chart(total_counts).mark_point(size=60).encode(
x="publication_date:T",
y=altair.Y("count", scale=altair.Scale(type="log")),
tooltip=["pmcid"],
href="pmc_url:N",
).interactive()
As we do not have too many annotated papers yet, we can look at them individually and inspect the age means and ranges.
We start by selecting groups for which the mean age is known.
data = subgroups.dropna(subset=("age mean",)).copy()
The rest is not especially interesting, it is just configuring the plot.
data["subgroup_idx"] = data.groupby("pmcid").cumcount().values
ax = altair.Axis(ticks=False, grid=False, labels=False)
scale = altair.Scale(domain=[-1, 2])
tooltip = [
"pmcid",
"group_name",
"subgroup_name",
"female count",
"male count",
"diagnosis",
]
point = (
altair.Chart(height=35)
.mark_point()
.encode(
x="age mean",
y=altair.Y("subgroup_idx", axis=ax, title=None, scale=scale),
color="group_name",
size="count",
href="pmc_url:N",
tooltip=tooltip,
)
)
rule = (
altair.Chart()
.mark_rule()
.encode(
altair.X("age minimum"),
altair.X2("age maximum"),
y=altair.Y("subgroup_idx", axis=ax, title=None, scale=scale),
color="group_name",
href="pmc_url:N",
tooltip=tooltip,
)
)
altair.layer(point, rule, data=data).facet(
row=altair.Row(
"pmcid:N",
header=altair.Header(labelAngle=0, labelAlign="left"),
sort=altair.Sort(field="age mean", op="mean"),
),
).configure_facet(spacing=0).configure_view(stroke=None)
In the plot above,
The position of each circle represents the mean of the corresponding subgroup
The bar represents the age range (when it is known)
The size of the circle represents the number of participants in the subgroup.