Skip to content

Commit

Permalink
Deploy Sphinx documentation 1527de2
Browse files Browse the repository at this point in the history
  • Loading branch information
github-actions[bot] committed Oct 28, 2024
0 parents commit 4735a89
Show file tree
Hide file tree
Showing 96 changed files with 8,235 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
config: cdf485b9f897c3c784b3534aab3b7c2a
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
Binary file added _images/interactive.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/json.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added _images/simplify.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions _sources/api/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
API Reference
============

.. toctree::
:maxdepth: 2

downloader
parser
filing_viewer
mulebot

This section provides detailed API documentation for datamule's modules.

Note: This documentation is automatically generated from the source code.
72 changes: 72 additions & 0 deletions _sources/datasets.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
Datasets
========

Available Datasets
----------------

datamule provides access to several SEC datasets:

- **FTD Data** (1.3GB, ~60s to download)
* Every FTD since 2004
* ``dataset='ftd'``

- **10-Q Filings**
* Every 10-Q since 2001
* 500MB-3GB per year, ~5 minutes to download
* ``dataset='10q_2023'`` (replace year as needed)

- **10-K Filings**
* Every 10-K from 2001 to September 2024
* ``dataset='10k_2002'`` (replace year as needed)

- **13F-HR Information Tables**
* Every 13F-HR Information Table since 2013
* Updated to current date
* ``dataset='13f_information_table'``

- **MD&A Collection**
* 100,000 MD&As since 2001
* Requires free API key (beta)

Usage Example
-----------

.. code-block:: python
downloader.download_dataset(dataset='ftd')
downloader.download_dataset(dataset='10q_2023')
downloader.download_dataset(dataset='13f_information_table')
Notes
-----

* Bulk datasets may become out of date. Use ``download_dataset()`` + ``download()`` to fill gaps
* The 13f_information_table dataset automatically implements gap-filling

Package Data
-----------

The package includes several useful CSV datasets:

- ``company_former_names.csv``: Former names of companies
- ``company_metadata.csv``: Metadata including SIC classification
- ``company_tickers.csv``: CIK, ticker, name mappings
- ``sec-glossary.csv``: Form types and descriptions
- ``xbrl_descriptions.csv``: Category fact descriptions

Usage Example

.. code-block:: python
from datamule import load_package_dataset
company_tickers = pd.DataFrame(load_package_dataset('company_tickers'))
Updating Package Data
-------------------

You can update the package data using:

.. code-block:: python
downloader.update_company_tickers()
downloader.update_metadata()
48 changes: 48 additions & 0 deletions _sources/examples.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
Examples
========

Basic Downloads
-------------

Download 10-K filings for specific companies:

.. code-block:: python
import datamule as dm
downloader = dm.Downloader()
# Download by CIK
downloader.download(form='10-K', cik='1318605')
# Download by ticker
downloader.download(form='10-K', ticker=['TSLA', 'META'])
Working with XBRL Data
--------------------

Parse and analyze XBRL data:

.. code-block:: python
from datamule import parse_company_concepts
# Download company concepts
downloader.download_company_concepts(ticker='AAPL')
# Parse the data
tables = parse_company_concepts(company_concepts)
Using MuleBot
-----------

Set up a MuleBot instance:

.. code-block:: python
from datamule.mulebot import MuleBot
mulebot = MuleBot(openai_api_key="your-api-key")
mulebot.run()
For more examples, check out our `GitHub repository <https://github.com/john-friedman/datamule-python/tree/main/examples>`_.
33 changes: 33 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
Welcome to datamule's documentation!
==================================

A Python package to work with SEC filings at scale. Also includes `Mulebot <https://chat.datamule.xyz/>`_, an open-source chatbot for SEC data that does not require storage. Integrated with `datamule <https://datamule.xyz/>`_'s APIs and datasets.

Features
--------

- Monitor EDGAR for new filings
- Parse textual filings into simplified HTML, interactive HTML, or structured JSON
- Download SEC filings quickly and easily
- Access datasets such as every 10-K, SIC codes, etc.
- Interact with SEC data using MuleBot

.. toctree::
:maxdepth: 2
:caption: Contents:

installation
quickstart
usage/index
datasets
examples
known_issues
changelog
api/index

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
30 changes: 30 additions & 0 deletions _sources/installation.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Installation
===========

Basic Installation
----------------

To install the basic package:

.. code-block:: bash
pip install datamule
Installation with Additional Features
---------------------------------

To install with specific features:

.. code-block:: bash
pip install datamule[filing_viewer] # Install with filing viewer module
pip install datamule[mulebot] # Install with MuleBot
pip install datamule[all] # Install all extras
Available Extras
--------------

- ``filing_viewer``: Includes dependencies for the filing viewer module
- ``mulebot``: Includes MuleBot for interacting with SEC data
- ``mulebot_server``: Includes Flask server for running MuleBot
- ``all``: Installs all available extras
30 changes: 30 additions & 0 deletions _sources/known_issues.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
Known Issues
===========

SEC File Malformation
-------------------

Some SEC files are malformed, which can cause parsing errors. For example, this `Tesla Form D HTML from 2009 <https://www.sec.gov/Archives/edgar/data/1318605/000131860509000004/xslFormDX01/primary_doc.xml>`_ is missing a closing ``</meta>`` tag.

Workaround:

.. code-block:: python
from lxml import etree
with open('filings/000131860509000005primary_doc.xml', 'r', encoding='utf-8') as file:
html = etree.parse(file, etree.HTMLParser())
Current Development Issues
------------------------

* Documentation needed for filing and parser modules
* Need to add current names to former names
* Conductor needs more robustness with new options
* Need to add facet filters for forms
* SEC search engine implementation pending
* MuleBot custom HTML templates needed
* MuleBot summarization features and token usage protections needed
* Path compatibility needs verification on non-Windows devices
* Analytics implementation pending
* Download success message accuracy needs improvement
19 changes: 19 additions & 0 deletions _sources/quickstart.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Quick Start
==========

Basic Usage
----------

Here's a simple example to get you started:

.. code-block:: python
import datamule as dm
downloader = dm.Downloader()
downloader.download(form='10-K', ticker='AAPL')
API Key
-------

Some datasets and features require an API key. [WIP]
129 changes: 129 additions & 0 deletions _sources/usage/dataset_builder.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
Dataset Builder
==============

Transforms unstructured text data into structured datasets using Gemini API. You can get a free API Key from `Google AI Studio <https://aistudio.google.com/app/apikey>`_ with a 15 rpm limit. For higher rate limits, you can then setup the Google $300 Free Credit Trial for 90 days.

Requirements
-----------

Input CSV must contain ``accession_number`` and ``text`` columns.

Methods
-------

set_api_key(api_key)
Sets Google Gemini API key for authentication.

set_paths(input_path, output_path, failed_path)
Sets input CSV path, output path, and failed records log path.

set_base_prompt(prompt)
Sets prompt template for Gemini API.

set_response_schema(schema)
Sets expected JSON schema for validation.

set_model(model_name)
Sets Gemini model (default: 'gemini-1.5-flash-8b').

set_rpm(rpm)
Sets API rate limit (default: 1500).

set_save_frequency(frequency)
Sets save interval in records (default: 100).

build()
Processes input CSV and generates dataset.

Usage
-----

.. code-block:: python
from datamule.dataset_builder.dataset_builder import DatasetBuilder
import os
builder = DatasetBuilder()
# Set API key
builder.set_api_key(os.environ["GOOGLE_API_KEY"])
# Set required configurations
builder.set_paths(
input_path="data/item502.csv",
output_path="data/bod.csv",
failed_path="data/failed_accessions.txt"
)
builder.set_base_prompt("""Extract Director or Principal Officer info to JSON format.
Provide the following information:
- start_date (YYYYMMDD)
- end_date (YYYYMMDD)
- name (First Middle Last)
- title
Return null if info unavailable.""")
builder.set_response_schema({
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"start_date": {"type": "STRING", "description": "Start date in YYYYMMDD format"},
"end_date": {"type": "STRING", "description": "End date in YYYYMMDD format"},
"name": {"type": "STRING", "description": "Full name (First Middle Last)"},
"title": {"type": "STRING", "description": "Official title/position"}
},
"required": ["start_date", "end_date", "name", "title"]
}
})
# Optional configurations
builder.set_rpm(1500)
builder.set_save_frequency(100)
builder.set_model('gemini-1.5-flash-8b')
# Build the dataset
builder.build()
API Key Setup
------------

1. Get API Key:
Visit `Google AI Studio <https://aistudio.google.com/app/apikey>`_ to generate your API key.

2. Set API Key as Environment Variable:

Windows (Command Prompt):
::

setx GOOGLE_API_KEY your-api-key

Windows (PowerShell):
::

[System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key', 'User')

macOS/Linux (bash):
::

echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.bash_profile
source ~/.bash_profile

macOS (zsh):
::

echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.zshrc
source ~/.zshrc

Note: Replace 'your-api-key' with your actual API key.


Alternative API Key Setup
-----------------------

You can also set the API key directly in your Python code, though this is not recommended for production:

.. code-block:: python
api_key = "your-api-key" # Replace with your actual API key
builder.set_api_key(api_key)
Loading

0 comments on commit 4735a89

Please sign in to comment.