Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset not found in pre-populated database file #207

Open
andrerainho opened this issue Jul 11, 2023 · 8 comments
Open

Dataset not found in pre-populated database file #207

andrerainho opened this issue Jul 11, 2023 · 8 comments
Labels
bug Something isn't working PriorityHigh

Comments

@andrerainho
Copy link

andrerainho commented Jul 11, 2023

Problem statement
I run bigquery-emulator with a yaml schema and a database file; insert some data and exit.
When re-running the emulator again; dataset and tables doesn't appear on queries.
If I open them in sqlite browser dataset1 exits base64 encoded.

Steps to reproduce

  1. Run the emulator
docker run \
            --name=bq-emulator \
             --rm -v $PWD:/work \
             -p 9050:9050 \
             ghcr.io/goccy/bigquery-emulator:0.4.2 \
             --project=test \
             --database=/work/test.db \
             --data-from-yaml="/work/server/testdata/data.yaml"

[bigquery-emulator] REST server listening at 0.0.0.0:9050
[bigquery-emulator] gRPC server listening at 0.0.0.0:9060
  1. Query data with bq and curl
bq --api http://0.0.0.0:9050 query --project_id=test "SELECT * FROM dataset1.table_a WHERE id = 1"

+----+-------+---------------------------------------------+------------+----------+---------------------+
| id | name  |                  structarr                  |  birthday  | skillNum |     created_at      |
+----+-------+---------------------------------------------+------------+----------+---------------------+
|  1 | alice | [{"key":"profile","value":"{\"age\": 10}"}] | 2012-01-01 |        3 | 2022-01-01 12:00:00 |
+----+-------+---------------------------------------------+------------+----------+---------------------+
curl http://0.0.0.0:9050/bigquery/v2/projects/test/datasets/dataset1/tables\?prettyPrint\=false | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   385  100   385    0     0   8677      0 --:--:-- --:--:-- --:--:-- 10694
{
  "tables": [
    {
      "creationTime": "1689084629",
      "id": "test:dataset1.table_a",
      "kind": "bigquery#table",
      "tableReference": {
        "datasetId": "dataset1",
        "projectId": "test",
        "tableId": "table_a"
      },
      "type": "TABLE"
    },
    {
      "creationTime": "1689084629",
      "id": "test:dataset1.table_b",
      "kind": "bigquery#table",
      "tableReference": {
        "datasetId": "dataset1",
        "projectId": "test",
        "tableId": "table_b"
      },
      "type": "TABLE"
    }
  ],
  "totalItems": 2
}
  1. Stop the emulator and run it using only the database file
    (here I'm not using the schema in the arguments)
docker stop bq-emulator
docker run \
            --name=bq-emulator \
             --rm -v $PWD:/work \
             -p 9050:9050 \
             ghcr.io/goccy/bigquery-emulator:0.4.2 \
             --project=test \
             --database=/work/test.db

[bigquery-emulator] REST server listening at 0.0.0.0:9050
[bigquery-emulator] gRPC server listening at 0.0.0.0:9060

The data is there;

bq --api http://0.0.0.0:9050 query --project_id=test "SELECT * FROM dataset1.table_a WHERE id = 1"

+----+-------+---------------------------------------------+------------+----------+---------------------+
| id | name  |                  structarr                  |  birthday  | skillNum |     created_at      |
+----+-------+---------------------------------------------+------------+----------+---------------------+
|  1 | alice | [{"key":"profile","value":"{\"age\": 10}"}] | 2012-01-01 |        3 | 2022-01-01 12:00:00 |
+----+-------+---------------------------------------------+------------+----------+---------------------+

However, if I try to insert to data it tells me that dataset1 is not found.

pipenv run python3 insert_dataset1.py

bigquery-emulator/testdata/.venv/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET http://0.0.0.0:9050/bigquery/v2/projects/test/datasets

/dataset1?prettyPrint=false: dataset dataset1 is not found

If I try to list it with curl the same warning.
The same request worked previously!

curl http://0.0.0.0:9050/bigquery/v2/projects/test/datasets/dataset1/tables\?prettyPrint\=false
{
  "error": {
    "errors": [
      {
        "reason": "notFound",
        "location": "",
        "debugInfo": "",
        "message": "dataset dataset1 is not found"
      }
    ],
    "code": 404,
    "message": "dataset dataset1 is not found"
  }
}

Any suggestions or ideas ?

@andrerainho
Copy link
Author

andrerainho commented Jul 11, 2023

The script I used to insert data is this one.

# insert_dataset.py

from google.cloud import bigquery
from google.api_core.client_options import ClientOptions

your_project = "test"
your_dataset = "dataset1"
your_table = "table_a"
your_endpoint = "http://0.0.0.0:9050"
dataset_id = f"{your_project}.{your_dataset}"

data = [
  {
    "birthday": "2007-02-01",
    "created_at": "2022-01-02 18:00:00",
    "id": "3",
    "name": "forbar",
    "skillNum": "5",
    "structarr": [
      {
        "key": "profile",
        "value": "{\"age\": 15}"
      }
    ]
  }
]

if __name__ == '__main__':
    options = ClientOptions(api_endpoint=your_endpoint)
    client = bigquery.Client(project=your_project, client_options=options)

    dataset_ref = client.get_dataset(your_dataset)
    table_ref = dataset_ref.table(your_table)
    table = client.get_table(table_ref)

    try:
        print('Inserting data ...')
        client.insert_rows(table, data)
        print('Data inserted successfully.')
    except ValueError:
        print("Error: Invalid value")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

@faisalhasnain
Copy link

I seeing the same behavior

@prismec
Copy link

prismec commented Oct 24, 2023

Same for me

@totem3 totem3 added bug Something isn't working PriorityHigh labels Dec 10, 2023
@totem3
Copy link
Collaborator

totem3 commented Dec 10, 2023

Thank you for reporting.

I am currently investigating this issue and would like to share my findings so far. Although more research is needed for a resolution, here's what I've understood to this point:

The main concern seems to come from the inability to retrieve datasets using the API. Initially, I examined the API responsible for fetching datasets. I discovered that at this point in the code, the project information is retrieved from ctx, and this project information is used to access the datasets.

When the application is launched with the --data-from-yaml option, the projects hold information about the datasets. However, when launched without this option, it seems that the projects do not retain the datasets' information.

The project information in ctx is provided by withProjectMiddleware. This, in turn, gets the project information from findProjects in repository.go, which simply fetches information from the projects table.

I've compared the projects table under two different conditions: with and without the --data-from-yaml option. The findings are as follows:

When the application is started with the --data-from-yaml option, the output of the command select * from projects is:

~/ghq/github.com/goccy/go-zetasqlite/cmd/zetasqlite-cli/zetasqlite-cli issue_207.db "select * from projects"
zetasqlite> select * from projects
+------+------------+--------+
| id   | datasetIDs | jobIDs |
+------+------------+--------+
| proj | [ds]       | []     |
+------+------------+--------+

without --data-from-yaml

~/ghq/github.com/goccy/go-zetasqlite/cmd/zetasqlite-cli/zetasqlite-cli issue_207.db
zetasqlite> select * from projects
+------+------------+--------+
| id   | datasetIDs | jobIDs |
+------+------------+--------+
| proj | []         | []     |
+------+------------+--------+

Here, the same project "proj" shows an empty array for both datasetIDs and jobIDs.

From these observations, I suspect that at some point during the application's startup process, the metadata of the project, specifically the datasets information, is being overwritten. (maybe somewhere in loader.go.) This leads to the absence of datasets information when the application is started without the --data-from-yaml option. Further investigation is needed to confirm this hypothesis and identify the exact point in the code where this overwrite might be happening.

My investigation steps were as follows:

cat  issue_207.yaml
projects:
- id: proj
  datasets:
  - id: ds
    tables:
    - id: tbl
      columns:
      - name: col1
        type: STRING

# shell 1
./bigquery-emulator --port=9050 --grpc-port=9060 --project proj --database issue_207.db --data-from-yaml issue_207.yaml 

# another shell
bq --api http://0.0.0.0:9050 --project_id proj ls

# shell 1
# kill previous db and launch another one without --data-from-yaml
./bigquery-emulator --port=9050 --grpc-port=9060 --project proj --database issue_207.db 

# another shell
bq --api http://0.0.0.0:9050 --project_id proj ls

@totem3

This comment was marked as resolved.

@AxTrusov
Copy link

Hey! Is there any progress on this issue? Without it it's basically impossible to persist data. Or maybe there is some workaround?

@jovulic
Copy link

jovulic commented Nov 5, 2024

I do want to preface that I have not looked into the issue here much at all (and could be missing parts), but...

Hey! Is there any progress on this issue? Without it it's basically impossible to persist data. Or maybe there is some workaround?

If you have the willingness patch and build it yourself (and possibly if your usage of bigquery-emulator is like mine -- just calling it with the project and database flags), it looks like you can work-around the issue by removing this Server.Load call. The Server.SetProject call above should handle creating the project the first time, and nothing else within the load routine looks necessary.

And if it proves interesting/useful, below is some more information about the issue, at least from what I saw as I went poking around.

It looks like the issue is related to the Server.Load call that happens on startup when it uses the project it built using the project flag (and dataset, if specified). The problem is that the created project will not include the datasets and tables that were created external to bigquery-emulator, however, the project will be used as the authority of what exists and will write that information to the metadata-related tables.

In other words, if you run the emulator with something like bigquery-emulator --project="test_project" --database=".data", create a dataset like bq mk --api="http://127.0.0.1:9050" --dataset "test_project:test_dataset", you can see the dataset exists with bq ls --api="http://127.0.0.1:9050" --project_id="test_project". If you then stop and start the emulator with the same arguments, the dataset will then be missing as the related metadata has been clobbered when bigquery-emulator writes the metadata as "project test_project with no datasets".

The specific Server.Load call looks to be here and the part that clobbers looks to be here via the Repository.UpdateProject call as while the project does exists, the project argument to the function will not match what is stored, at least if you are using a database file and have made changes externally to bigquery-emulator.

@chrisK824
Copy link

There is an open PR around that for a quite while now. #269

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PriorityHigh
Projects
None yet
Development

No branches or pull requests

7 participants