-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 4735a89
Showing
96 changed files
with
8,235 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: cdf485b9f897c3c784b3534aab3b7c2a | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
API Reference | ||
============ | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
|
||
downloader | ||
parser | ||
filing_viewer | ||
mulebot | ||
|
||
This section provides detailed API documentation for datamule's modules. | ||
|
||
Note: This documentation is automatically generated from the source code. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
Datasets | ||
======== | ||
|
||
Available Datasets | ||
---------------- | ||
|
||
datamule provides access to several SEC datasets: | ||
|
||
- **FTD Data** (1.3GB, ~60s to download) | ||
* Every FTD since 2004 | ||
* ``dataset='ftd'`` | ||
|
||
- **10-Q Filings** | ||
* Every 10-Q since 2001 | ||
* 500MB-3GB per year, ~5 minutes to download | ||
* ``dataset='10q_2023'`` (replace year as needed) | ||
|
||
- **10-K Filings** | ||
* Every 10-K from 2001 to September 2024 | ||
* ``dataset='10k_2002'`` (replace year as needed) | ||
|
||
- **13F-HR Information Tables** | ||
* Every 13F-HR Information Table since 2013 | ||
* Updated to current date | ||
* ``dataset='13f_information_table'`` | ||
|
||
- **MD&A Collection** | ||
* 100,000 MD&As since 2001 | ||
* Requires free API key (beta) | ||
|
||
Usage Example | ||
----------- | ||
|
||
.. code-block:: python | ||
downloader.download_dataset(dataset='ftd') | ||
downloader.download_dataset(dataset='10q_2023') | ||
downloader.download_dataset(dataset='13f_information_table') | ||
Notes | ||
----- | ||
|
||
* Bulk datasets may become out of date. Use ``download_dataset()`` + ``download()`` to fill gaps | ||
* The 13f_information_table dataset automatically implements gap-filling | ||
|
||
Package Data | ||
----------- | ||
|
||
The package includes several useful CSV datasets: | ||
|
||
- ``company_former_names.csv``: Former names of companies | ||
- ``company_metadata.csv``: Metadata including SIC classification | ||
- ``company_tickers.csv``: CIK, ticker, name mappings | ||
- ``sec-glossary.csv``: Form types and descriptions | ||
- ``xbrl_descriptions.csv``: Category fact descriptions | ||
|
||
Usage Example | ||
|
||
.. code-block:: python | ||
from datamule import load_package_dataset | ||
company_tickers = pd.DataFrame(load_package_dataset('company_tickers')) | ||
Updating Package Data | ||
------------------- | ||
|
||
You can update the package data using: | ||
|
||
.. code-block:: python | ||
downloader.update_company_tickers() | ||
downloader.update_metadata() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
Examples | ||
======== | ||
|
||
Basic Downloads | ||
------------- | ||
|
||
Download 10-K filings for specific companies: | ||
|
||
.. code-block:: python | ||
import datamule as dm | ||
downloader = dm.Downloader() | ||
# Download by CIK | ||
downloader.download(form='10-K', cik='1318605') | ||
# Download by ticker | ||
downloader.download(form='10-K', ticker=['TSLA', 'META']) | ||
Working with XBRL Data | ||
-------------------- | ||
|
||
Parse and analyze XBRL data: | ||
|
||
.. code-block:: python | ||
from datamule import parse_company_concepts | ||
# Download company concepts | ||
downloader.download_company_concepts(ticker='AAPL') | ||
# Parse the data | ||
tables = parse_company_concepts(company_concepts) | ||
Using MuleBot | ||
----------- | ||
|
||
Set up a MuleBot instance: | ||
|
||
.. code-block:: python | ||
from datamule.mulebot import MuleBot | ||
mulebot = MuleBot(openai_api_key="your-api-key") | ||
mulebot.run() | ||
For more examples, check out our `GitHub repository <https://github.com/john-friedman/datamule-python/tree/main/examples>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
Welcome to datamule's documentation! | ||
================================== | ||
|
||
A Python package to work with SEC filings at scale. Also includes `Mulebot <https://chat.datamule.xyz/>`_, an open-source chatbot for SEC data that does not require storage. Integrated with `datamule <https://datamule.xyz/>`_'s APIs and datasets. | ||
|
||
Features | ||
-------- | ||
|
||
- Monitor EDGAR for new filings | ||
- Parse textual filings into simplified HTML, interactive HTML, or structured JSON | ||
- Download SEC filings quickly and easily | ||
- Access datasets such as every 10-K, SIC codes, etc. | ||
- Interact with SEC data using MuleBot | ||
|
||
.. toctree:: | ||
:maxdepth: 2 | ||
:caption: Contents: | ||
|
||
installation | ||
quickstart | ||
usage/index | ||
datasets | ||
examples | ||
known_issues | ||
changelog | ||
api/index | ||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Installation | ||
=========== | ||
|
||
Basic Installation | ||
---------------- | ||
|
||
To install the basic package: | ||
|
||
.. code-block:: bash | ||
pip install datamule | ||
Installation with Additional Features | ||
--------------------------------- | ||
|
||
To install with specific features: | ||
|
||
.. code-block:: bash | ||
pip install datamule[filing_viewer] # Install with filing viewer module | ||
pip install datamule[mulebot] # Install with MuleBot | ||
pip install datamule[all] # Install all extras | ||
Available Extras | ||
-------------- | ||
|
||
- ``filing_viewer``: Includes dependencies for the filing viewer module | ||
- ``mulebot``: Includes MuleBot for interacting with SEC data | ||
- ``mulebot_server``: Includes Flask server for running MuleBot | ||
- ``all``: Installs all available extras |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
Known Issues | ||
=========== | ||
|
||
SEC File Malformation | ||
------------------- | ||
|
||
Some SEC files are malformed, which can cause parsing errors. For example, this `Tesla Form D HTML from 2009 <https://www.sec.gov/Archives/edgar/data/1318605/000131860509000004/xslFormDX01/primary_doc.xml>`_ is missing a closing ``</meta>`` tag. | ||
|
||
Workaround: | ||
|
||
.. code-block:: python | ||
from lxml import etree | ||
with open('filings/000131860509000005primary_doc.xml', 'r', encoding='utf-8') as file: | ||
html = etree.parse(file, etree.HTMLParser()) | ||
Current Development Issues | ||
------------------------ | ||
|
||
* Documentation needed for filing and parser modules | ||
* Need to add current names to former names | ||
* Conductor needs more robustness with new options | ||
* Need to add facet filters for forms | ||
* SEC search engine implementation pending | ||
* MuleBot custom HTML templates needed | ||
* MuleBot summarization features and token usage protections needed | ||
* Path compatibility needs verification on non-Windows devices | ||
* Analytics implementation pending | ||
* Download success message accuracy needs improvement |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Quick Start | ||
========== | ||
|
||
Basic Usage | ||
---------- | ||
|
||
Here's a simple example to get you started: | ||
|
||
.. code-block:: python | ||
import datamule as dm | ||
downloader = dm.Downloader() | ||
downloader.download(form='10-K', ticker='AAPL') | ||
API Key | ||
------- | ||
|
||
Some datasets and features require an API key. [WIP] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,129 @@ | ||
Dataset Builder | ||
============== | ||
|
||
Transforms unstructured text data into structured datasets using Gemini API. You can get a free API Key from `Google AI Studio <https://aistudio.google.com/app/apikey>`_ with a 15 rpm limit. For higher rate limits, you can then setup the Google $300 Free Credit Trial for 90 days. | ||
|
||
Requirements | ||
----------- | ||
|
||
Input CSV must contain ``accession_number`` and ``text`` columns. | ||
|
||
Methods | ||
------- | ||
|
||
set_api_key(api_key) | ||
Sets Google Gemini API key for authentication. | ||
|
||
set_paths(input_path, output_path, failed_path) | ||
Sets input CSV path, output path, and failed records log path. | ||
|
||
set_base_prompt(prompt) | ||
Sets prompt template for Gemini API. | ||
|
||
set_response_schema(schema) | ||
Sets expected JSON schema for validation. | ||
|
||
set_model(model_name) | ||
Sets Gemini model (default: 'gemini-1.5-flash-8b'). | ||
|
||
set_rpm(rpm) | ||
Sets API rate limit (default: 1500). | ||
|
||
set_save_frequency(frequency) | ||
Sets save interval in records (default: 100). | ||
|
||
build() | ||
Processes input CSV and generates dataset. | ||
|
||
Usage | ||
----- | ||
|
||
.. code-block:: python | ||
from datamule.dataset_builder.dataset_builder import DatasetBuilder | ||
import os | ||
builder = DatasetBuilder() | ||
# Set API key | ||
builder.set_api_key(os.environ["GOOGLE_API_KEY"]) | ||
# Set required configurations | ||
builder.set_paths( | ||
input_path="data/item502.csv", | ||
output_path="data/bod.csv", | ||
failed_path="data/failed_accessions.txt" | ||
) | ||
builder.set_base_prompt("""Extract Director or Principal Officer info to JSON format. | ||
Provide the following information: | ||
- start_date (YYYYMMDD) | ||
- end_date (YYYYMMDD) | ||
- name (First Middle Last) | ||
- title | ||
Return null if info unavailable.""") | ||
builder.set_response_schema({ | ||
"type": "ARRAY", | ||
"items": { | ||
"type": "OBJECT", | ||
"properties": { | ||
"start_date": {"type": "STRING", "description": "Start date in YYYYMMDD format"}, | ||
"end_date": {"type": "STRING", "description": "End date in YYYYMMDD format"}, | ||
"name": {"type": "STRING", "description": "Full name (First Middle Last)"}, | ||
"title": {"type": "STRING", "description": "Official title/position"} | ||
}, | ||
"required": ["start_date", "end_date", "name", "title"] | ||
} | ||
}) | ||
# Optional configurations | ||
builder.set_rpm(1500) | ||
builder.set_save_frequency(100) | ||
builder.set_model('gemini-1.5-flash-8b') | ||
# Build the dataset | ||
builder.build() | ||
API Key Setup | ||
------------ | ||
|
||
1. Get API Key: | ||
Visit `Google AI Studio <https://aistudio.google.com/app/apikey>`_ to generate your API key. | ||
|
||
2. Set API Key as Environment Variable: | ||
|
||
Windows (Command Prompt): | ||
:: | ||
|
||
setx GOOGLE_API_KEY your-api-key | ||
|
||
Windows (PowerShell): | ||
:: | ||
|
||
[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key', 'User') | ||
|
||
macOS/Linux (bash): | ||
:: | ||
|
||
echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.bash_profile | ||
source ~/.bash_profile | ||
|
||
macOS (zsh): | ||
:: | ||
|
||
echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.zshrc | ||
source ~/.zshrc | ||
|
||
Note: Replace 'your-api-key' with your actual API key. | ||
|
||
|
||
Alternative API Key Setup | ||
----------------------- | ||
|
||
You can also set the API key directly in your Python code, though this is not recommended for production: | ||
|
||
.. code-block:: python | ||
api_key = "your-api-key" # Replace with your actual API key | ||
builder.set_api_key(api_key) |
Oops, something went wrong.