feat: async methods to speed up api.get_dataframe #73

sg-s · 2024-08-23T01:50:55Z

problem description

api.get_dataframe (and therefore DataFrame.from_deeporigin) can take extremely long because:

2 calls to list_database_rows and describe_row take place in series when they could take place in parallel (this is a problem of 700ms vs 300ms)
(more serious problem). there is a place where N DescriebFile calls are made in series, where N is the number of files in the database. this means that the time it takes to fetch data grows with the number of files linearly, and the time for each file resolution is ~350ms.

solution

parallelize 1
parallelize calls to DescribeFile

technical discussion and changes

refactor code so that DescribeFile calls are isolated in a tight loop
write helper functions to create a configured async client
convert that tight loop into a parallel loop using async/await
use asyncio to bundle the 2 calls in problem 1 together
remove a semi-circular import where _api imports pandas and api

## problem description `api.get_dataframe` (and therefore `DataFrame.from_deeporigin`) can take extremely long because: 1. 2 calls to `list_database_rows` and `describe_row` take place in series when they could take place in parallel (this is a problem of 700ms vs 300ms) 2. (more serious problem). there is a place where N `DescriebFile` calls are made in series, where N is the number of files in the database. this means that the time it takes to fetch data grows with the number of files linearly, and the time for each file resolution is ~350ms. ## proposed solution in this PR using a new `ListFiles` endpoint that allows us to resolve all file IDs and names in a single API call (which typically takes 300ms no matter how many files) ### screenshot from cold start, this is how long it takes to display a dataframe. in the old version, this took ~10s <img width="699" alt="Screenshot 2024-08-30 at 8 52 19 AM" src="https://github.com/user-attachments/assets/71b16e2c-ba1c-425a-a8ec-2fd7dce9bf9f"> ## rejected alternative using async methods. this approach was tried in this PR: #73 it was rejected because it's much slower, and complicates testing dramatically

refactor: moved the mapping of file ids to names outside core loop

b09de29

sg-s self-assigned this Aug 23, 2024

sg-s changed the title ~~refactor: moved the mapping of file ids to names outside core loop~~ feat: async methods to speed up api.get_dataframe Aug 23, 2024

sg-s added 4 commits August 23, 2024 08:37

fix: ready to switch to async

3ca75f3

feat: working proof of concept of asyncio code

19f0bba

fix: tests pass local

6d9c028

fix: added describe files

630e66e

sg-s mentioned this pull request Aug 30, 2024

feat: much faster file name resolution using ListFiles #75

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: async methods to speed up api.get_dataframe #73

feat: async methods to speed up api.get_dataframe #73

sg-s commented Aug 23, 2024 •

edited

Loading

feat: async methods to speed up api.get_dataframe #73

Are you sure you want to change the base?

feat: async methods to speed up api.get_dataframe #73

Conversation

sg-s commented Aug 23, 2024 • edited Loading

problem description

solution

technical discussion and changes

sg-s commented Aug 23, 2024 •

edited

Loading