GitHub Repositories for Sequencing Companies
Using R and the GitHub API to get licensing information for GitHub repos for sequencing-related companies, one-shot with GPT-5.
I saw this Tweet a few days ago from Jeremy Leipzig counting the number of GitHub repositories that sequencing-related companies have on their organization page, but was immediately curious how many of these had an open-source license versus a restrictive or unknown license.
I tried to one-shot this with GPT-5 given a screenshot of Jeremy’s Tweet and a few instructions. It got me 90% of the way but I had to make a few tweaks here and there. The code1 uses the gh R package to access the GitHub API, pulls all the repos from each organization, extracts the license, and classifies it as open / permissive if it’s something like MIT, BSD, Apache, various flavors of GPL, and certain CC licenses, and restrictive / unknown if it’s anything else or unlisted. Note, “insightsengineering” is Roche/Genentech.
Here’s the full data retrieved using this code.
The code will give you a ggplot2 plot, but I’m embedding everything here using Datawrapper, which also has an R package to create visualizations through the API.