Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offer public API to get dataset info? #926

Open
SajidAlamQB opened this issue Nov 7, 2024 · 8 comments
Open

Offer public API to get dataset info? #926

SajidAlamQB opened this issue Nov 7, 2024 · 8 comments

Comments

@SajidAlamQB
Copy link
Contributor

Description

Related to: kedro-org/kedro-viz#1893 (comment)

When trying to retrieve metadata (like file size) from datasets in Kedro Viz we are running into issues, specifically within our DatasetStatsHook. Currently, to obtain this metadata, we need to access private attributes (e.g., _filepath, _fs, _protocol), which is causing inconsistency and compatibility problems with datasets that do not expose these attributes.

Context

By having a standardised, public way to retrieve dataset metadata we can improve best practices. This benefits users who want to collect dataset metadata or build plugins, as it provides a consistent and maintainable approach.

Possible Implementation

Perhaps adding an optional, public method (e.g., get_info) to datasets that can provide metadata like file size, etc.

@MinuraPunchihewa
Copy link
Contributor

Hey @SajidAlamQB,
I am happy to work on this. Can you provide some information on what kind of metadata would need to be retrieved exactly? Is there a list?

@SajidAlamQB
Copy link
Contributor Author

Hi @MinuraPunchihewa,

Thank you for your response and for offering to work on this!

I think some of the metadata we're Interested in is:

  • File Size (file_size): The size of the dataset file on disk.
  • Number of Rows (num_rows): For tabular datasets (e.g., CSV, Parquet, SQL tables), the total number of rows.
  • Number of Columns (num_columns): For tabular datasets, the total number of columns.

@astrojuanlu did you have any thoughts on what else?

@astrojuanlu
Copy link
Member

paging @BielStela since you opened kedro-org/kedro-viz#1714 :)

before proceeding with get_info though (which was a strawman proposal), given that there's a AbstractDataset._describe, should we actually have a describe for this purpose, and reuse what's already being returned? Looking at a random dataset, we have a bunch of properties there already

def _describe(self):
return {
"filepath": self._filepath,
"protocol": self._protocol,
"load_args": self._load_args,
"save_args": self._save_args,
"version": self._version,
}

@MinuraPunchihewa
Copy link
Contributor

@astrojuanlu This makes sense to me. I can reuse this and add any extra metadata that would be required. Is there anything outside of the list that @SajidAlamQB has given above?

@astrojuanlu
Copy link
Member

Would love to know @merelcht and @rashidakanchwala opinion before proceeding

@astrojuanlu
Copy link
Member

(Specifically, whether we should add a dedicated public get_info or if it should be describe and wrap _describe)

@rashidakanchwala
Copy link
Contributor

rashidakanchwala commented Nov 18, 2024

I missed this but I love this! It would really help Kedro-viz; though we need to decide what it means for each dataset. For instance, for SQLDatasets, could it get_info provide SQL Query etc.

@astrojuanlu
Copy link
Member

astrojuanlu commented Nov 18, 2024

We discussed this in backlog grooming and we acknowledge that the original problem (that Viz needs private data from datasets kedro-org/kedro-viz#1893 (comment)) exists and we want to address it.

There was past discussion about _describe when we implemented preview and it was discarded #504 (comment)

We didn't feel like adding more Viz-specific public methods, although an exception was made for said preview method since it wasn't part of AbstractDataset #504 (comment)

This problem was already anticipated when implementing the preview functionality kedro-org/kedro-viz#662 (comment) and re-discovered by a user some months later kedro-org/kedro-viz#1714

Renaming this issue so it's more clear that this is an open problem.

@astrojuanlu astrojuanlu changed the title Add an Optional get_info Method to Datasets for Metadata Retrieval Offer public API to get dataset info? Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants