Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: async methods to speed up api.get_dataframe #73

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Conversation

sg-s
Copy link
Collaborator

@sg-s sg-s commented Aug 23, 2024

problem description

api.get_dataframe (and therefore DataFrame.from_deeporigin) can take extremely long because:

  1. 2 calls to list_database_rows and describe_row take place in series when they could take place in parallel (this is a problem of 700ms vs 300ms)
  2. (more serious problem). there is a place where N DescriebFile calls are made in series, where N is the number of files in the database. this means that the time it takes to fetch data grows with the number of files linearly, and the time for each file resolution is ~350ms.

solution

  • parallelize 1
  • parallelize calls to DescribeFile

technical discussion and changes

  • refactor code so that DescribeFile calls are isolated in a tight loop
  • write helper functions to create a configured async client
  • convert that tight loop into a parallel loop using async/await
  • use asyncio to bundle the 2 calls in problem 1 together
  • remove a semi-circular import where _api imports pandas and api

@sg-s sg-s self-assigned this Aug 23, 2024
@sg-s sg-s changed the title refactor: moved the mapping of file ids to names outside core loop feat: async methods to speed up api.get_dataframe Aug 23, 2024
sg-s added a commit that referenced this pull request Aug 31, 2024
## problem description

`api.get_dataframe` (and therefore `DataFrame.from_deeporigin`) can take
extremely long because:

1. 2 calls to `list_database_rows` and `describe_row` take place in
series when they could take place in parallel (this is a problem of
700ms vs 300ms)
2. (more serious problem). there is a place where N `DescriebFile` calls
are made in series, where N is the number of files in the database. this
means that the time it takes to fetch data grows with the number of
files linearly, and the time for each file resolution is ~350ms.

## proposed solution in this PR

using a new `ListFiles` endpoint that allows us to resolve all file IDs
and names in a single API call (which typically takes 300ms no matter
how many files)

### screenshot

from cold start, this is how long it takes to display a dataframe. in
the old version, this took ~10s

<img width="699" alt="Screenshot 2024-08-30 at 8 52 19 AM"
src="https://github.com/user-attachments/assets/71b16e2c-ba1c-425a-a8ec-2fd7dce9bf9f">



## rejected alternative

using async methods. this approach was tried in this PR:
#73

it was rejected because it's much slower, and complicates testing
dramatically
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant