Exploring the bioRxiv API with R, httr2, rvest, tidytext, and Datawrapper
Collect metadata and publication details for >200k preprints over a 10 year period, investigate trends, and scrape full text for sentiment analysis
Last year I wrote a post describing an R package I put together that fetches recent bioRxiv preprints from a given subject and summarizes them in a couple of sentences using a local LLM running through Ollama:
That tool has a limitation in that it’s using the bioRxiv RSS feed to pull recent paper titles and abstracts, and the RSS feeds currently only provide the 30 most recent preprints in each subject area. Because of that I recently started exploring bioRxiv’s open API at api.biorxiv.org. The API is simple and the docs are easy enough to follow along.
In the spirit of learning in public, I’ll demonstrate how to interact with this API1 using httr2 and standard tidyverse tools to collect preprint metadata and peer-reviewed publication details for >200,000 preprints over a 10 year period. I also wanted to use this as an opportunity to try out Datawrapper for creating interactive visualizations, and the DatawRappr R package for sending data from R to Datawrapper. Next, I use rvest to scrape the full text of a selection of preprints by subject area, tidytext to perform sentiment analysis, and conclude with a few ideas of other things you could do with this data.
Preprint metadata
I started by pulling details for every article in the 10 year period 2014-2023. The code to do this is here, details below. This captures the title, authors, corresponding author and institutional affiliation, abstract, DOI, and other information for all of the preprints published in this period. After some lightweight cleaning and deduplication, I end up with 219,014 preprints. This data in CSV, RDS, and parquet formats is available here on Zenodo. Here’s an interactive plot of the number of preprints by subject area over that 10 year period, with a few called out. Details below.
The article detail endpoint is simple, and the API docs are clear. Take a URL like this: https://api.biorxiv.org/details/biorxiv/2024-08-01/2024-08-31/201. You provide start and end dates, and you’ll get back JSON for 100 articles that fall in that range. The last part of the URL is the “cursor” — since you can only get 100 results at a time, you can increment this number to get 100 results starting from wherever you want. In this case, I’m getting 100 results starting at result number 201. In the code I used to capture all the preprints from this period I looped through incrementing the cursor by 100 until I no longer retrieved any new results.
I’m also doing some simple data cleaning and deduplication. I tried deduplicating by only taking the most recent version of each DOI, but that left a few cases where the same paper was uploaded multiple times, each assigned a different DOI. The numbers I end up with don’t match exactly what’s in bioRxiv’s summary reports, but they’re close.
Here’s a snippet of the code — see the GitHub gist for the full collection, cleaning, and summarization code.
library(tidyverse)
library(httr2)
# Set URL path variables
url <- "https://api.biorxiv.org/"
what <- "details"
server <- "biorxiv"
date1 <- "2014-01-01"
date2 <- "2023-12-31"
# Create base request URL
basereq <-
request(url) |>
req_url_path_append(what) |>
req_url_path_append(server) |>
req_url_path_append(date1) |>
req_url_path_append(date2)
# Iterate through 100 publications at a time until there are no more
# Break out of the loop if there are no more pubs in the offset
cursor <- 0
responses <- list()
ok <- TRUE
while (ok) {
req <-
basereq |>
req_url_path_append(cursor)
resp <-
req |>
req_perform() |>
resp_body_json()
ok <- resp$messages[[1]]$status=="ok" && length(resp$collection)>0
if (!ok) break
responses[[as.character(cursor)]] <- resp
cursor <- cursor + 100L
}
# Turn that list into a tibble
rdf <-
responses |>
map(\(x) x$collection |> enframe() |> unnest_wider(col=value)) |>
bind_rows()
# ...further data cleaning and processing code...
Published article details
bioRxiv attempts to update a preprint’s page with a link to the final peer-reviewed, published article. A similar endpoint allows you to pull information about published articles. E.g., https://api.biorxiv.org/pubs/medrxiv/2024-08-01/2024-08-31 will return data in JSON format about published articles from August 2024 (see notes above about the “cursor” needed to capture more than 100 results). Here we get information about the DOI and publication date of the preprint, as well as the DOI, publication date, and name of the journal of the final peer-reviewed publication. Using similar code as above, we can retrieve publication detail for every preprint that has an associated publication from the same 10 year period. After cleaning and deduplication, I end up with 119,237 preprints/publications. From here we can look at a similar plot to the one above — how many preprints were peer reviewed and published in each subject for each year? The decline in publications after 2021 is interesting. The average lag between publication and preprint is less than a year. Perhaps there’s a lag between the publication of the paper in a peer-reviewed journal and the entry getting updated in bioRxiv’s database. Or, perhaps it’s that more people are happy with a published preprint and don’t feel the need to continue pursuing “publication” in a peer reviewed journal.
With this information we could start to ask interesting questions, such as how long is the average lag between preprint and publication date, and how does it vary by subject area? (Biochemistry is the shortest at 187 days, while Systems biology and Ecology are tied for the longest lag at 272 days). Or, we could look at the most common journals where preprints from each subject area end up getting published:
Full text scraping and sentiment analysis
Here I’ll use rvest to scrape the full text of a preprint given the DOI2, and demonstrate a simple sentiment analysis on the full text of articles from different subject areas. The code to do this is here as a GitHub gist, details below.
Of the topics shown here, cancer biology has the lowest sentiment, as would be expected based on the kind of words you would expect to see in any cancer paper introduction (disease, death, mortality, morbidity, tumor, metastasis, etc). Scientific communication and education has the highest sentiment.
I’m first writing a function that uses rvest to scrape the full text of an article on bioRxiv given the DOI. Here’s the function, and an example of running this on a single DOI:
library(tidyverse)
library(rvest)
library(tidytext)
## Function to get full text given a DOI
get_full_text <- function(doi) {
paste0("https://www.biorxiv.org/content/", doi, ".full") |>
rvest::read_html() |>
rvest::html_nodes("div.section") |>
rvest::html_text() |>
grep("^references", x=_,
invert=TRUE,
value=TRUE,
ignore.case=TRUE) |>
grep("^bibliography", x=_,
invert=TRUE,
value=TRUE,
ignore.case=TRUE) |>
paste(collapse="\n\n")
}
## Example: Get full text for a single DOI
doi <- "10.1101/2023.07.14.549004"
get_full_text(doi)
In the next section, I choose a few categories of interest, and extract the full text from the 100 latest articles in each of those sections by mapping my new function over all those DOIs using purrr (see the full code here).
## Choose topics
mycats <- c("Bioinformatics",
"Scientific Communication And Education",
"Cancer Biology")
## Get full text from 50 most recent papers from each
fulltext_subset <-
rdf |>
filter(category %in% mycats) |>
slice_tail(n=100, by=category) |>
mutate(fulltext=map_chr(.data$doi, get_full_text))
Finally, I’m using the tidytext package to perform sentiment analysis on the full text of these subjects, as demonstrated here. This sentiment analysis is fairly rudimentary. The Bing lexicon I’m using here categorizes words into a binary positive/negative sentiment. I’m simply counting up the number of positive and negative sentiment words across all full texts in each subject, the calculating the sentiment score as sum(positive)-sum(negative), and also calculating the percent positive. Here’s a snippet (see the full code here).
## Get sentiment analysis words
bingsentiment <- get_sentiments("bing")
## Conduct sentiment analysis
sentiment_results <-
fulltext_subset |>
unnest_tokens(word, fulltext) |>
anti_join(stop_words, by="word") |>
inner_join(bingsentiment, by="word") |>
select(category, word, sentiment) |>
count(category, sentiment) |>
pivot_wider(names_from=sentiment, values_from=n, values_fill=0) |>
mutate(total=positive+negative,
sentiment=positive-negative,
percent_positive=positive/total) |>
arrange(percent_positive)
Other things you could do
This was just a tiny example of a few things you could do with all preprint metadata and the ability to retrieve full text from as many preprints as you wish. I could think of several other kinds of analyses that might be interesting, either with this data alone, or by connecting with other data such as citation counts, funding data, or final published text.
Citation Analysis: This would require getting access to citation data, e.g. from CrossRef or other resources.
Citation Growth: Analyze the growth in citations for preprints over time, comparing this growth across different subject areas.
Impact of Early Citations: Investigate whether preprints that receive early citations are more likely to be peer-reviewed and published, and how this varies by field.
Collaboration Networks:
Authorship Networks: Map out the collaboration networks between authors, identifying key hubs or influential researchers within different subject areas.
Institutional Collaborations: Analyze which institutions collaborate the most, and in which subject areas. You could also explore the geographical spread of collaborations.
Text Analysis / NLP / LLMs:
Topic Modeling: Use topic modeling (e.g., LDA) to identify the most common topics in preprints over time. This could show how the focus of research changes across subject areas.
Abstract Complexity: Measure the complexity or readability of abstracts across different fields and how it correlates with publication success.
LLM training: Be careful here — you might need to check the license information on each of the preprints before including them in such a training corpus. Luckily, this license information is available in the data retrieved from the API.
Temporal Analysis:
Diff between preprint versions: You could extract the full text of subsequent versions of preprint submissions, and see how those change: what was added, what was removed, what changed.
Seasonal Trends: Explore whether there are seasonal trends in preprint submissions or peer-reviewed publications within different subject areas.
Diversity Analysis: Analyze the diversity of authorship by looking at the geographic or institutional distribution and how it changes over time or between fields.
Funding and Preprint Success:
Funding Source Analysis: If you could connect institutions and authors to funding information like what’s available from NIH reporter, explore the relationship between funding sources and the likelihood of a preprint being published in a peer-reviewed journal.
Grant Acknowledgements: Analyze whether mentioning certain funding agencies or grants in preprints correlates with faster publication or higher citation rates.
Comparison Between Preprints and Published Versions:
Content Evolution: Compare the content of the original preprints with their final published versions to analyze what changes are typically made during the peer-review process.
Word Count Analysis: Measure how the length of articles changes from preprint to published version and whether there are significant differences between fields.
Impact of COVID-19:
COVID-19 Influence: Analyze the impact of the COVID-19 pandemic on preprint submissions, peer-review timelines, and publication rates across different fields.
Preprint Citations During the Pandemic: Investigate whether preprints published during the pandemic received more citations than those published before or after.
There’s actually an R package to interact with the bioRxiv API: rbiorxiv (https://github.com/nicholasmfraser/rbiorxiv). I didn’t look into this package because (1) it was removed from CRAN in 2021, and (2) it hasn’t been updated in over 3 years. It may still work, but I wanted to learn and demonstrate interacting with the API with httr2 anyway.
Technically, this section isn’t using the bioRxiv API. This page describes how to retrieve bulk access to the full text of all bioRxiv articles for text and data mining, LLM training, or other use cases, but it’s in a requestor-pays AWS bucket, and I don’t actually want everything. This is simply scraping the text from the bioRxiv website using rvest. In theory you could use the code here to get all the DOIs from every preprint ever published on bioRxiv, then use the scraping procedure described here to extract all the full text from every article. However, be nice — you’ll almost certainly be temporarily banned / rate-limited by bioRxiv’s Cloudflare protection if you use the code here to scrape all articles without any limits.