Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect credentials handling in pandas.GBQTableDataset and pandas.GBQQueryDataset #975

Open
abhi8893 opened this issue Jan 4, 2025 · 0 comments
Labels
Community Issue/PR opened by the open-source community

Comments

@abhi8893
Copy link

abhi8893 commented Jan 4, 2025

Description

The credentials handling in pandas.GBQTableDataset and pandas.GBQQueryDataset is incorrect in 2 ways:

  1. The credentials type passed to underlying pandas_gbq.read_gbq is incorrect.

The credentials should be of type google.auth.credentials.Credentials but is incorrectly annotated as google.oauth2.credentials.Credentials

From https://googleapis.dev/python/pandas-gbq/latest/api.html#pandas_gbq.read_gbq

credentials ([google.auth.credentials.Credentials](https://googleapis.dev/python/google-auth/latest/reference/google.auth.credentials.html#google.auth.credentials.Credentials), optional) –

Credentials for accessing Google APIs. Use this parameter to override default credentials, such as to use Compute Engine \\[google.auth.compute_engine.Credentials](https://googleapis.dev/python/google-auth/latest/reference/google.auth.compute_engine.html#google.auth.compute_engine.Credentials) or Service Account [google.oauth2.service_account.Credentials](https://googleapis.dev/python/google-auth/latest/reference/google.oauth2.service_account.html#google.oauth2.service_account.Credentials) directly.
  1. When passing a dictionary to credentials argument, the code directly instantiates the Credentials class

google.auth.credentials.Credentials is the base class for all credentials implemented in the https://github.com/googleapis/google-auth-library-python. It is not meant to be instantiated directly.

Instantiating only the google.oauth2.credentials.Credentials also doesn't seem correct. The user should have the flexibilty to instantiate any credentials class as long as it bases on google.auth.credentials.Credentials

Possible implementation

To support Python API:

SImply change the type annotation to google.auth.credentials.Credentials

To support YAML API:

This poses the generic issue of supporting non native type instantiation through yaml. Ideally the user should have the flexibilty to instantiate any credentials class as long as it bases on google.auth.credentials.Credentials

This can be done by implementing a functionality natively into kedro which allows to instantiate any arbitrary object with arguments (named or otherwise).

my_pd_gbq_dataset:
  type: pandas.GBQQueryDataset
  credentials:
    object: google.oauth2.service_account.Credentials.from_service_account_info
    type: service_account
    project_id: ...
    private_key_id: ...
    private_key: ...
    client_email: ...
    client_id: ...
    auth_uri: ...
    token_uri: ...
    auth_provider_x509_cert_url: ...
    client_x509_cert_url: ...
    universe_domain: ...

Then in code we can load whatever object is specified and pass the remaining arguments.

However, most usecases would likely only require the service account json when a dictionary is passed, hence it makes sense to assume that if a dictionary is passed, then the intended credentials are Service Account credentials.

So, inside the code we can do the following:

@@ -13,7 +13,8 @@ import pandas as pd
 import pandas_gbq as pd_gbq
 from google.cloud import bigquery
 from google.cloud.exceptions import NotFound
-from google.oauth2.credentials import Credentials
+from google.auth.credentials import Credentials
+from google.oauth2.service_account import Credentials as ServiceAccountCredentials
 from kedro.io.core import (
     AbstractDataset,
     DatasetError,
@@ -78,7 +79,7 @@ class GBQTableDataset(ConnectionMixin, AbstractDataset[None, pd.DataFrame]):
         dataset: str,
         table_name: str,
         project: str | None = None,
-        credentials: dict[str, Any] | Credentials | None = None,
+        credentials: dict[str, Any] | str | Credentials | None = None,
         load_args: dict[str, Any] | None = None,
         save_args: dict[str, Any] | None = None,
         metadata: dict[str, Any] | None = None,
@@ -92,10 +93,9 @@ class GBQTableDataset(ConnectionMixin, AbstractDataset[None, pd.DataFrame]):
                 Optional when available from the environment.
                 https://cloud.google.com/resource-manager/docs/creating-managing-projects
             credentials: Credentials for accessing Google APIs.
-                Either ``google.auth.credentials.Credentials`` object or dictionary with
-                parameters required to instantiate ``google.oauth2.credentials.Credentials``.
-                Here you can find all the arguments:
-                https://google-auth.readthedocs.io/en/latest/reference/google.oauth2.credentials.html
+                Either a credential that bases on ``google.auth.credentials.Credentials`` OR
+                a service account json as a dictionary OR
+                a path to a service account key json file.
             load_args: Pandas options for loading BigQuery table into DataFrame.
                 Here you can find all available arguments:
                 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_gbq.html
@@ -277,7 +277,9 @@ class GBQQueryDataset(AbstractDataset[None, pd.DataFrame]):
         self._project_id = project
 
         if isinstance(credentials, dict):
-            credentials = Credentials(**credentials)
+            credentials = ServiceAccountCredentials.from_service_account_info(credentials)
+        elif isinstance(credentials, str):
+            credentials = ServiceAccountCredentials.from_service_account_file(credentials)
 
         self._credentials = credentials

The same pattern is also followed in gcsfs library where they allow any google.auth.credentials.Credentials object. If a string is passed it calls ServiceAccountCredentials.from_service_account_file, and if a dictionary is passed it calls ServiceAccountCredentials.from_service_account_info(credentials).

See here: https://github.com/fsspec/gcsfs/blob/main/gcsfs/credentials.py

Happy to open a PR if above implementation makes sense! 🙂

@merelcht merelcht added the Community Issue/PR opened by the open-source community label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community
Projects
None yet
Development

No branches or pull requests

2 participants