A tool that automatically sorts PDF documents into folders using an LLM. Works great to clean up messy folders, or to use for filing new files from scanners. Has a nice TextUI or via flag can print to the console. Comes with a Docker container.
You specify the folder structure as a simple text document that looks like this:
- Financial : Financial documents and banking records
- Bank Accounts : Bank statements and account records
- By company
- Taxes : Tax returns and related documents
- By year
- Insurance : Insurance policies and claims
- By company
- Other : Finance documents that don't fit in above subcategories
If you are lazy, you can ask an LLM to write the structure for you. Right now the software supports local files, Google Drive and Dropbox. For models, it supports OpenLLM and Mistral. Drivers for these are modular and it's very easy to extend to other platforms.
- Install uv
- Create credentials if you want to use Dropbox or Google Drive
- Create
.envfile with your configuration - Create
layout.txtin your document store that describes the layout - Run:
uv run main.py --copy
| Variable | Description |
|---|---|
MISTRAL_API_KEY |
API key for Mistral AI |
OPENAI_API_KEY |
API key for OpenAI |
LLM_PROVIDER |
openai or mistral |
INBOX |
Source: gdrive:<folder-id>, local:<path>, or dropbox:<path> |
DOCSTORE |
Destination: gdrive:<folder-id>, local:<path>, or dropbox:<path> |
# Process inbox (preview only)
uv run main.py
# Process and copy to document store
uv run main.py --copy
# Verify files exist and re-copy missing
uv run main.py --copy --verify
# Re-analyze documents (ignore cache) and move if path changed
uv run main.py --update --copy
# Daemon mode: monitor inbox, copy, log, delete after success
uv run main.py --ingest
# Repair cache and handle duplicates
uv run main.py --repair
# Process single file
uv run main.py --file /path/to/document.pdf --copy| Option | Description |
|---|---|
--copy |
Copy files to document store |
--verify |
Verify destination exists (re-copy if missing) |
--update |
Re-analyze via LLM (ignore cache), move if path changed |
--log |
Log incoming files to --IncomingLog folder |
--inbox <uri> |
Inbox URI (overrides INBOX env var) |
--ingest |
Daemon mode: poll inbox, copy, log, delete source |
--repair |
Scan docstore, fix cache, handle duplicates |
--file <path> |
Process single PDF |
--showlayout |
Print folder hierarchy |
--deduplicate |
Merge duplicate company folders |
--auth-dropbox |
Authenticate with Dropbox (one-time setup) |
--cli |
Use CLI output instead of TUI |
- Go to Google Cloud Console
- Create a project (or select existing)
- Enable the Google Drive API
- Create a Service Account (IAM & Admin → Service Accounts)
- Create and download a JSON key for the service account
- Save as
service_account_key.jsonin the project root - Share your Drive folders with the service account email address
- Use the folder ID from the URL:
https://drive.google.com/drive/folders/<folder-id>
- Go to Dropbox App Console
- Create a new app with:
- Scoped access
- Full Dropbox access
- Permissions:
files.metadata.read,files.content.read,files.content.write
- Run the authentication flow:
uv run main.py --auth-dropbox
- Enter your App key and App secret when prompted
- Follow the browser link to authorize
- Credentials are saved to
dropbox_token.json
The document store needs a layout.txt defining your folder hierarchy:
Personal documents for the Smith family.
The Smiths are a family of 5. Parents are Bob and Mary, kids are Steve and Joe.
We have a side business renting out our house on Forest Lake via Airbnb via Lakehouse LLC.
Bob coaches for the local Little League.
---LAYOUT STARTS HERE---
- Financial : Banking and financial documents
- Bank Accounts : Statements and records
- By company
- By year
- Taxes : Tax returns
- By year
- Insurance : Policies and claims
- By company
- Medical : Health records
- By year
- By year: Auto-creates year subfolders (e.g.,
Taxes/2024/) - By company: Auto-creates company subfolders (e.g.,
Insurance/Geico/)
The --ingest mode is designed to monitor an inbox. If a new file is detected it moves it to the docstore and additionally save
them into a log. I use this with a ScanSnap scanner. Each scanned file automatically is filed correctly.
Specifically in this mode each file is:
- Written a log directory
--IngestLogin the docstore with the ingest date prepended (i.e. "2026-01-17 Tax Form 1099 for 2024 from...") - Filed correctly in the right folder in the docstore
- Deleted in the inbox
Dockerfile is included. Works great with --ingest mode.
--repair scans the docstore and:
- Fixes cache entries where
copiedflag was incorrectly reset - Detects duplicates (same file in multiple locations)
- Moves duplicates to
--Duplicatefolder (keeps the one matching suggested path) - Skips system folders starting with
--
Apache 2.0 - see LICENSE file.
