Deploy Sphinx documentation 1527de2

john-friedman · Oct 28, 2024 · 4735a89 · 4735a89
commit 4735a89
Show file tree

Hide file tree

Showing 96 changed files with 8,235 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file records the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: cdf485b9f897c3c784b3534aab3b7c2a
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/interactive.png b/_images/interactive.png
diff --git a/_images/json.png b/_images/json.png
diff --git a/_images/simplify.png b/_images/simplify.png
diff --git a/_sources/api/index.rst.txt b/_sources/api/index.rst.txt
@@ -0,0 +1,14 @@
+API Reference
+============
+
+.. toctree::
+   :maxdepth: 2
+
+   downloader
+   parser
+   filing_viewer
+   mulebot
+
+This section provides detailed API documentation for datamule's modules.
+
+Note: This documentation is automatically generated from the source code.
diff --git a/_sources/datasets.rst.txt b/_sources/datasets.rst.txt
@@ -0,0 +1,72 @@
+Datasets
+========
+
+Available Datasets
+----------------
+
+datamule provides access to several SEC datasets:
+
+- **FTD Data** (1.3GB, ~60s to download)
+    * Every FTD since 2004
+    * ``dataset='ftd'``
+
+- **10-Q Filings**
+    * Every 10-Q since 2001
+    * 500MB-3GB per year, ~5 minutes to download
+    * ``dataset='10q_2023'`` (replace year as needed)
+
+- **10-K Filings**
+    * Every 10-K from 2001 to September 2024
+    * ``dataset='10k_2002'`` (replace year as needed)
+
+- **13F-HR Information Tables**
+    * Every 13F-HR Information Table since 2013
+    * Updated to current date
+    * ``dataset='13f_information_table'``
+
+- **MD&A Collection**
+    * 100,000 MD&As since 2001
+    * Requires free API key (beta)
+
+Usage Example
+-----------
+
+.. code-block:: python
+
+    downloader.download_dataset(dataset='ftd')
+    downloader.download_dataset(dataset='10q_2023')
+    downloader.download_dataset(dataset='13f_information_table')
+
+Notes
+-----
+
+* Bulk datasets may become out of date. Use ``download_dataset()`` + ``download()`` to fill gaps
+* The 13f_information_table dataset automatically implements gap-filling
+
+Package Data
+-----------
+
+The package includes several useful CSV datasets:
+
+- ``company_former_names.csv``: Former names of companies
+- ``company_metadata.csv``: Metadata including SIC classification
+- ``company_tickers.csv``: CIK, ticker, name mappings
+- ``sec-glossary.csv``: Form types and descriptions
+- ``xbrl_descriptions.csv``: Category fact descriptions
+
+Usage Example
+
+.. code-block:: python
+    
+    from datamule import load_package_dataset
+    company_tickers = pd.DataFrame(load_package_dataset('company_tickers'))
+
+Updating Package Data
+-------------------
+
+You can update the package data using:
+
+.. code-block:: python
+
+    downloader.update_company_tickers()
+    downloader.update_metadata()
diff --git a/_sources/examples.rst.txt b/_sources/examples.rst.txt
@@ -0,0 +1,48 @@
+Examples
+========
+
+Basic Downloads
+-------------
+
+Download 10-K filings for specific companies:
+
+.. code-block:: python
+
+    import datamule as dm
+    
+    downloader = dm.Downloader()
+    
+    # Download by CIK
+    downloader.download(form='10-K', cik='1318605')
+    
+    # Download by ticker
+    downloader.download(form='10-K', ticker=['TSLA', 'META'])
+
+Working with XBRL Data
+--------------------
+
+Parse and analyze XBRL data:
+
+.. code-block:: python
+
+    from datamule import parse_company_concepts
+    
+    # Download company concepts
+    downloader.download_company_concepts(ticker='AAPL')
+    
+    # Parse the data
+    tables = parse_company_concepts(company_concepts)
+
+Using MuleBot
+-----------
+
+Set up a MuleBot instance:
+
+.. code-block:: python
+
+    from datamule.mulebot import MuleBot
+    
+    mulebot = MuleBot(openai_api_key="your-api-key")
+    mulebot.run()
+
+For more examples, check out our `GitHub repository <https://github.com/john-friedman/datamule-python/tree/main/examples>`_.
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,33 @@
+Welcome to datamule's documentation!
+==================================
+
+A Python package to work with SEC filings at scale. Also includes `Mulebot <https://chat.datamule.xyz/>`_, an open-source chatbot for SEC data that does not require storage. Integrated with `datamule <https://datamule.xyz/>`_'s APIs and datasets.
+
+Features
+--------
+
+- Monitor EDGAR for new filings
+- Parse textual filings into simplified HTML, interactive HTML, or structured JSON
+- Download SEC filings quickly and easily
+- Access datasets such as every 10-K, SIC codes, etc.
+- Interact with SEC data using MuleBot
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   installation
+   quickstart
+   usage/index
+   datasets
+   examples
+   known_issues
+   changelog
+   api/index
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/_sources/installation.rst.txt b/_sources/installation.rst.txt
@@ -0,0 +1,30 @@
+Installation
+===========
+
+Basic Installation
+----------------
+
+To install the basic package:
+
+.. code-block:: bash
+
+   pip install datamule
+
+Installation with Additional Features
+---------------------------------
+
+To install with specific features:
+
+.. code-block:: bash
+
+   pip install datamule[filing_viewer]  # Install with filing viewer module
+   pip install datamule[mulebot]        # Install with MuleBot
+   pip install datamule[all]            # Install all extras
+
+Available Extras
+--------------
+
+- ``filing_viewer``: Includes dependencies for the filing viewer module
+- ``mulebot``: Includes MuleBot for interacting with SEC data
+- ``mulebot_server``: Includes Flask server for running MuleBot
+- ``all``: Installs all available extras
diff --git a/_sources/known_issues.rst.txt b/_sources/known_issues.rst.txt
@@ -0,0 +1,30 @@
+Known Issues
+===========
+
+SEC File Malformation
+-------------------
+
+Some SEC files are malformed, which can cause parsing errors. For example, this `Tesla Form D HTML from 2009 <https://www.sec.gov/Archives/edgar/data/1318605/000131860509000004/xslFormDX01/primary_doc.xml>`_ is missing a closing ``</meta>`` tag.
+
+Workaround:
+
+.. code-block:: python
+
+    from lxml import etree
+
+    with open('filings/000131860509000005primary_doc.xml', 'r', encoding='utf-8') as file:
+        html = etree.parse(file, etree.HTMLParser())
+
+Current Development Issues
+------------------------
+
+* Documentation needed for filing and parser modules
+* Need to add current names to former names
+* Conductor needs more robustness with new options
+* Need to add facet filters for forms
+* SEC search engine implementation pending
+* MuleBot custom HTML templates needed
+* MuleBot summarization features and token usage protections needed
+* Path compatibility needs verification on non-Windows devices
+* Analytics implementation pending
+* Download success message accuracy needs improvement
diff --git a/_sources/quickstart.rst.txt b/_sources/quickstart.rst.txt
@@ -0,0 +1,19 @@
+Quick Start
+==========
+
+Basic Usage
+----------
+
+Here's a simple example to get you started:
+
+.. code-block:: python
+
+    import datamule as dm
+
+    downloader = dm.Downloader()
+    downloader.download(form='10-K', ticker='AAPL')
+
+API Key
+-------
+
+Some datasets and features require an API key. [WIP]
diff --git a/_sources/usage/dataset_builder.rst.txt b/_sources/usage/dataset_builder.rst.txt
@@ -0,0 +1,129 @@
+Dataset Builder
+==============
+
+Transforms unstructured text data into structured datasets using Gemini API. You can get a free API Key from `Google AI Studio <https://aistudio.google.com/app/apikey>`_ with a 15 rpm limit. For higher rate limits, you can then setup the Google $300 Free Credit Trial for 90 days.
+
+Requirements
+-----------
+
+Input CSV must contain ``accession_number`` and ``text`` columns.
+
+Methods
+-------
+
+set_api_key(api_key)
+    Sets Google Gemini API key for authentication.
+
+set_paths(input_path, output_path, failed_path)
+    Sets input CSV path, output path, and failed records log path.
+
+set_base_prompt(prompt)
+    Sets prompt template for Gemini API.
+
+set_response_schema(schema)
+    Sets expected JSON schema for validation.
+
+set_model(model_name)
+    Sets Gemini model (default: 'gemini-1.5-flash-8b').
+
+set_rpm(rpm)
+    Sets API rate limit (default: 1500).
+
+set_save_frequency(frequency)
+    Sets save interval in records (default: 100).
+
+build()
+    Processes input CSV and generates dataset.
+
+Usage
+-----
+
+.. code-block:: python
+
+    from datamule.dataset_builder.dataset_builder import DatasetBuilder
+    import os
+
+    builder = DatasetBuilder()
+
+    # Set API key
+    builder.set_api_key(os.environ["GOOGLE_API_KEY"])
+
+    # Set required configurations
+    builder.set_paths(
+        input_path="data/item502.csv",
+        output_path="data/bod.csv",
+        failed_path="data/failed_accessions.txt"
+    )
+
+    builder.set_base_prompt("""Extract Director or Principal Officer info to JSON format. 
+    Provide the following information:
+    - start_date (YYYYMMDD)
+    - end_date (YYYYMMDD)
+    - name (First Middle Last)
+    - title
+    Return null if info unavailable.""")
+
+    builder.set_response_schema({
+        "type": "ARRAY",
+        "items": {
+            "type": "OBJECT",
+            "properties": {
+                "start_date": {"type": "STRING", "description": "Start date in YYYYMMDD format"},
+                "end_date": {"type": "STRING", "description": "End date in YYYYMMDD format"},
+                "name": {"type": "STRING", "description": "Full name (First Middle Last)"},
+                "title": {"type": "STRING", "description": "Official title/position"}
+            },
+            "required": ["start_date", "end_date", "name", "title"]
+        }
+    })
+
+    # Optional configurations
+    builder.set_rpm(1500)
+    builder.set_save_frequency(100)
+    builder.set_model('gemini-1.5-flash-8b')
+
+    # Build the dataset
+    builder.build()
+
+API Key Setup
+------------
+
+1. Get API Key:
+   Visit `Google AI Studio <https://aistudio.google.com/app/apikey>`_ to generate your API key.
+
+2. Set API Key as Environment Variable:
+
+   Windows (Command Prompt):
+   ::
+
+       setx GOOGLE_API_KEY your-api-key
+
+   Windows (PowerShell):
+   ::
+
+       [System.Environment]::SetEnvironmentVariable('GOOGLE_API_KEY', 'your-api-key', 'User')
+
+   macOS/Linux (bash):
+   ::
+
+       echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.bash_profile
+       source ~/.bash_profile
+
+   macOS (zsh):
+   ::
+
+       echo 'export GOOGLE_API_KEY="your-api-key"' >> ~/.zshrc
+       source ~/.zshrc
+
+   Note: Replace 'your-api-key' with your actual API key.
+
+
+Alternative API Key Setup
+-----------------------
+
+You can also set the API key directly in your Python code, though this is not recommended for production:
+
+.. code-block:: python
+
+    api_key = "your-api-key"  # Replace with your actual API key
+    builder.set_api_key(api_key)