Turn a GitHub repo into a single text file for LLM-friendly input
The open source repo2txt web app will take a GitHub repo URL, displays the directory structure, lets you choose which files to include, and provides a single plain text file you can feed into any LLM.
If you use ChatGPT, Claude, or even some local model through Ollama or HuggingFace Assistants, you’ll know that the chat interface makes it challenging to feed in an entire repo like a Python or R package, because functions, tests, etc. can be scattered across many files throughout a repo. Here I’ll demonstrate how to turn an entire GitHub repo into a single text file for LLM-friendly input, with R and Python packages as examples.
GitHub Repo to Text Converter
This GitHub Repo to Text Converter can help here.
Web app: repo2txt.simplebasedomain.com
This little app takes a URL to a GitHub repo, lets you select which files in the repo directory structure to include, and creates a single plain text file you can feed into an LLM to ask for explanations, usage, or anything else you’d like to ask about the codebase. You can download or copy this text to the clipboard to later paste into an LLM of your choosing.
The tool runs entirely in the browser. You can also provide a personal access token if you want to access private repositories (again, securely, since everything runs in the browser).
Demo: Python package with GPT-4o
I recently demonstrated writing a CLI app using Click from a Cookiecutter template in Python. It’s a simple app that tells you how much caffeine remains in your system based on how much you consume and when you’ll go to bed. It’s on GitHub at stephenturner/caffeinated.
Here I’m pasting the repo URL, https://github.com/stephenturner/caffeinated into the app. The tool is smart enough to recognize that this is a Python package, and I probably want the __init__.py
, __main__.py
, other package .py
files, and the unit tests in tests/
that I’ll use with pytest
.
After hitting “Generate Text File” you’ll see that it generates a single plain text file, first listing out the directory structure of the selected files, then listing out the contents of each file separately. Here’s a preview. My __init__.py
is empty (just to signal that this is a package), and the cli.py
file is truncated here.
Directory Structure:
└── ./
├── caffeinated
│ ├── __init__.py
│ ├── __main__.py
│ └── cli.py
└── tests
└── test_caffeinated.py
---
File: /caffeinated/__init__.py
---
---
File: /caffeinated/__main__.py
---
from .cli import caffeinated
if __name__ == "__main__":
caffeinated()
---
File: /caffeinated/cli.py
---
import click
import math
from datetime import datetime
from importlib.metadata import version, PackageNotFoundError
### TRUNCATED ###
Once you have this, you can easily paste this into an LLM of your choice to chat with the codebase. Here, I pasted this into GPT-4o with the leading prompt: “Explain this code to me, how to use it what it does.” GPT-4o returned all of the text and code below as a single response.
First, the directory structure.
Next, an explanation of __main__.py
, cli.py
, and the tests.
Next, a demonstration of how to use it. Note above that I did not select the README to add to my text file. GPT-4o divined this usage information from the code itself, rather than regurgitating the README.
And finally, a simple explanation of the (minimal) test suite.
Demo: R package with Claude 3.5 Sonnet
This also works on an R package. In my last job I wrote a package to assist with containerizing R packages with a usethis-like interface. It’s called pracpac (practical R packaging [with Docker]), and you can read more in the paper or on the GitHub repo. Here I’ll take the code from pracpac repo and pull out the relevant R source files.
This time I asked Claude 3.5 Sonnet to tell me more about this package.
It’s just HTML+JavaScript
If you look at the source code you’ll see that this is a simple implementation using just HTML and JavaScript. Which means you can download the repo and open the index.html, or if you’d prefer a vanity URL, you can fork the repo and set up GitHub pages to serve from your main branch (e.g. stephenturner.github.io/repo2txt).