Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"asyncio.run() cannot be called from a running event loop" #23

Closed
debalee101 opened this issue Nov 24, 2024 · 8 comments
Closed

"asyncio.run() cannot be called from a running event loop" #23

debalee101 opened this issue Nov 24, 2024 · 8 comments

Comments

@debalee101
Copy link

CIK.xlsx

I am trying to download 10Ks from a list of CIKs [Sample list attached]. I am using the following code:

pip install datamule[all]

import pandas as pd
import datamule as dm

df_CIK = pd.read_excel('CIK.xlsx', sheet_name='Sheet1')
CIKlist = df_CIK['CIK'].tolist()

downloader = dm.Downloader()
downloader.set_limiter('www.sec.gov', 5)
downloader.set_limiter('efts.sec.gov', 5)

for CIK in CIKlist:
  output_dir = '10-K'
  metadata_csv = 'metadata.csv'

for CIK in CIKlist:
    try:
        print(f"Downloading 10-K forms for {CIK}...")
        downloader.download(
            form='10-K',
            cik=CIK,
            output_dir=output_dir,
            date=('2004-01-01', '2024-01-31'),
            save_metadata=True
        )
        print(f"Completed downloading for {CIK}")
    except Exception as e:
        print(f"Failed to download for {ticker}: {e}")

I have also tried:

async def download_10k():
    for CIK in CIKlist:
        try:
            print(f"Downloading 10-K forms for {CIK}...")
            await downloader.download(
                form='10-K',
                cik=CIK,
                output_dir=output_dir,
                date=('2004-01-01', '2024-01-31'),
                save_metadata=True
            )
            print(f"Completed downloading for {CIK}")
        except Exception as e:
            print(f"Failed to download for {CIK}: {e}")
try:
    loop = asyncio.get_running_loop()
    task = loop.create_task(download_10k())
    await task
except RuntimeError:
    asyncio.run(download_10k())

Or is it a good idea to download all available 10Ks within the date range, and then filter by the list of CIKs?

@john-friedman
Copy link
Owner

You can pass your ciklist directly into the downloader.

downloader.download(
    form='10-K',
    cik=CIKlist,
    output_dir=output_dir,
    date=('2004-01-01', '2024-01-31'),
    save_metadata=True)

Not sure whether the async option is coming from your first code block or second. If it's the second - downloader is already async and meant to work right out of the box which is why the error would occur. if it's the first - not sure why, would love to know if you're using jupyter notebooks / colab.

@debalee101
Copy link
Author

I added a blocked that solved the error:

!pip install nest_asyncio
import nest_asyncio
nest_asyncio.apply()
import asyncio

I have a question:

Does 'filing' not work when the file is .htm? I have tried the following:

dfs = []

for file in Path(output_dir).iterdir():
    print(file)

for file in Path(output_dir).iterdir():
  filing = Filing(str(file), '10-K')
  dfs.append(pd.DataFrame(filing))

df = pd.concat(dfs)
df.to_json('10k.json')

I get a 'document' error. I am not able to figure out what I am doing wrong here.

@john-friedman
Copy link
Owner

Hi @debalee101. Please share what you are running your code in. nest_asyncio is enabled by default for jupyter notebooks for this package, so if it's not working for you by default I would like to fix it.

Also nice catch! What causes the error is 10-K/A files being mixed in with your 10-Ks. This is because the downloader by default downloads root forms.

Try emptying your directory and running the downloader selecting only file_type = 10-K. (Works on my machine)
downloader.download(ticker=['MSFT','TSLA','AAPL'],output_dir=output_dir,form='10-K',file_types=['10-K'])

Note: that the code's behavior is annoying and I will be making it simpler in future versions, so this is very helpful.

@debalee101
Copy link
Author

Hey @john-friedman, I was running my code in Colab and found that nest_asyncio isn't enabled by default.
Thanks for your suggestion about the file_types suggestion; it worked for me.
I tried to use the parse/filing, but I couldn't find out how to parse my 10-Ks and store my needed section as a list with CIK-year identifier. parsed = parsed_data['document']['part1']['item1a'] has extracted the item1a for me, but I can't identify it with CIK-year.
I didn't find it in the documentation. I needed to extract parsed = parsed_data['document']['part1']['item1a'] either as a separate JSON for each CIK-year, or as a dict of lists.

@john-friedman
Copy link
Owner

Great, I added a note to enable colab by default in the next update.

Currently, to get CIK, you have to use the downloader option save_metadata

downloader.download(
    form='10-K',
    cik=CIKlist,
    output_dir=output_dir,
    date=('2004-01-01', '2024-01-31'),
    save_metadata=True)

which downloads metadata to metadata.jsonl, here is an excerpt
image

You can then use the metadata from its filepath or save it to a csv. e.g.

downloader.load_metadata(output_dir)
downloader.save_metadata_to_csv('metadata.csv')

I'm not happy with the current implementation, and will be changing it after I finish hosting my own archives. In the future the code should look more like this:

downloader.download(form='10-K',output_dir='10-K')

submission_dirs = os.listdir('10-K')
for dir in submission_dirs:
    submission = Submission(dir)
    for document in submission.document_type('10-K'):
        document.parse()
        item1a = document['document']['part1']['item1a']
        cik = submission.cik
        # your code to save cik and item1a to csv

@debalee101
Copy link
Author

As I try to extract item 1A, I have noticed that whenever a text of "Item *" is encountered, the next item is considered to have started [I have attached an image: The item 1A that is extracted stops at the word "See" in this context.].

Screenshot 2024-12-15 154232

This is what my list looks like:

    "000119312511008295_d10k.htm": {
        "ciks": [
            "0000785787"
        ],
        "period_ending": "2010-10-31",
        "file_date": "2011-01-14",
        "form": "10-K",
        "biz_states": [
            "NJ"
        ],
        "item1a": "ITEM 1A.\n\nRISK FACTORS\n\nYou should\ncarefully consider each of the risks and uncertainties described below and elsewhere in this Annual Report on Form 10-K, as well as any amendments or updates reflected in subsequent filings with the SEC. We believe these risks and\nuncertainties, individually or in the aggregate, could cause our actual results to differ materially from expected and historical results and could materially and adversely affect our business operations, results of operations, financial condition\nand liquidity. Further, additional risks and uncertainties not presently known to us or that we currently deem immaterial may also impair our results and business operations.\n\nIndustry Risks\n\nOur business is dependent on the price and availability of\nresin, our principal raw material, and our ability to pass on resin price increases to our customers.\n\nThe principal\nraw materials that we use in our products are polyethylene, polypropylene and polyvinyl chloride resins. Our ability to operate profitably is dependent, in a large part, on the markets for these resins. We use resins that are derived from petroleum\nand natural gas, and therefore prices of such resins fluctuate substantially as a result of changes in petroleum and natural gas prices, demand and the capacity of resin suppliers. Instability in the world markets for petroleum and natural gas could\nadversely affect the prices of our raw materials and their general availability. Over the past several years, we have at times experienced significant fluctuations in resin prices and availability.\n\nOur ability to maintain profitability is heavily dependent upon our ability to pass through to our customers the full amount of any\nincrease in raw material costs. Since resin costs fluctuate significantly, selling prices are generally determined as a \"spread\" over resin costs, usually expressed as cents per pound. The historical increases and decreases in resin costs\nhave generally been reflected over a period of time in the sales prices of the products on a penny-for-penny basis. Assuming a constant volume of sales, an increase in resin costs would, therefore, result in increased sales revenues but lower gross\nprofit as a percentage of sales or gross profit margin, while a decrease in resin costs would result in lower sales revenues with a higher gross profit margin. Further, the gap between the time at which an order is taken, resin is purchased,\nproduction occurs and shipment is made, has an impact on our financial results and our working capital needs. In a period of rising resin prices, this impact is generally negative to operating results and in periods of declining resin prices, the\nimpact is generally positive to operating results. If there is overcapacity in the production of any specific product that we manufacture and sell, we frequently are not able to pass through the full amount of any cost increase.\n\n11\n\nTable of Contents\n\nEconomic conditions in the United States during fiscal 2010 were difficult with global economic and financial markets experiencing substantial disruption, which also increased the difficulty in\npassing through the full amount of cost increases. If resin prices increase and we are not able to fully pass on the increases to our customers, our results of operations, financial condition and liquidity will be adversely affected. See"

CODE:

pip install datamule[all]

import pandas as pd
import datamule as dm

import json

from datamule import Downloader, Filing
from pathlib import Path
import shutil

!pip install nest_asyncio
import nest_asyncio
nest_asyncio.apply()
import asyncio
from datamule import load_package_dataset


df_CIK = pd.read_excel('CIK5.xlsx', sheet_name='Sheet1')
CIKlist = df_CIK['CIK'].tolist()

downloader = dm.Downloader()
downloader.set_limiter('www.sec.gov', 5)
downloader.set_limiter('efts.sec.gov', 5)

for CIK in CIKlist:
  output_dir = '10-K'
  metadata_csv = 'metadata.csv'

downloader.download(
    form='10-K',
    file_types=['10-K'],
    cik=CIKlist,
    output_dir=output_dir,
    date=('2004-01-01', '2024-01-31'),
    save_metadata=True)

downloader.load_metadata(output_dir)
downloader.save_metadata_to_csv('metadata.csv')

metadata_file = Path('/content/10-K/metadata.jsonl')
metadata_dict = {}

with open(metadata_file, 'r') as f:
    for line in f:
        data = json.loads(line.strip())
        metadata_dict.update(data)

filtered_dir = Path("filtered")

# DICTIONARY to store metadata and Item 1A

parsed_metadata = {}

for file in filtered_dir.iterdir():
    if file.suffix in ['.txt', '.htm']:
        try:
            # PARSE
            filing = Filing(str(file), '10-K')
            parsed_data = filing.parse_filing()

            # EXTRACT Item 1A
            item1a_content = parsed_data.get('document', {}).get('part1', {}).get('item1a', None)

            # EXTRACT the key from the file name
            # File name format: filtered/000119312510006494_d10k.htm
            file_key = file.stem.split('_')[0]  # Extract "000119312510006494"

            if file_key in metadata_dict:
                # COMBINE (metadata and Item 1A)
                metadata_entry = metadata_dict[file_key]['_source']
                parsed_metadata[file.name] = {
                    'metadata': metadata_entry,
                    'item1a': item1a_content
                }
            else:
                print(f"Metadata not found for file: {file.name}")

        except Exception as e:
            print(f"Error processing file {file}: {e}")

# JSON OUTPUT

output_file = Path("parsed_metadata_with_item1a.json")
with open(output_file, 'w') as f:
    json.dump(parsed_metadata, f, indent=4)

print(f"Parsed metadata with Item 1A saved to {output_file}")

#RESTRUCTURE

input_file = Path("parsed_metadata_with_item1a.json")
output_file = Path("formatted_metadata.json")

with open(input_file, 'r') as f:
    parsed_metadata = json.load(f)

formatted_metadata = {}
for file_name, data in parsed_metadata.items():
    formatted_metadata[file_name] = {
        "ciks": data['metadata'].get('ciks', []),
        "period_ending": data['metadata'].get('period_ending', None),
        "file_date": data['metadata'].get('file_date', None),
        "form": data['metadata'].get('form', None),
        "biz_states": data['metadata'].get('biz_states', []),
        "item1a": data.get('item1a', None)
    }

with open(output_file, 'w') as f:
    json.dump(formatted_metadata, f, indent=4)

print(f"Formatted metadata saved to {output_file}")

@john-friedman
Copy link
Owner

Strange, I thought I fixed that bug last month. Will look into it, thanks

@john-friedman
Copy link
Owner

Should be fixed now. Sorry it took so long - setting up a lot of other stuff!

On that note: I probably introduced a lot of new bugs. Unavoidable - switching to a new parsing system that will allow the package to parse any type of attachment - e.g. xml html pdf txt etc. Hoping they are minor, but if they are not, let me know and I will make them a priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants