Dataset not found in pre-populated database file #207

andrerainho · 2023-07-11T14:18:06Z

Problem statement
I run bigquery-emulator with a yaml schema and a database file; insert some data and exit.
When re-running the emulator again; dataset and tables doesn't appear on queries.
If I open them in sqlite browser dataset1 exits base64 encoded.

Steps to reproduce

Run the emulator

docker run \
            --name=bq-emulator \
             --rm -v $PWD:/work \
             -p 9050:9050 \
             ghcr.io/goccy/bigquery-emulator:0.4.2 \
             --project=test \
             --database=/work/test.db \
             --data-from-yaml="/work/server/testdata/data.yaml"

[bigquery-emulator] REST server listening at 0.0.0.0:9050
[bigquery-emulator] gRPC server listening at 0.0.0.0:9060

Query data with bq and curl

bq --api http://0.0.0.0:9050 query --project_id=test "SELECT * FROM dataset1.table_a WHERE id = 1"

+----+-------+---------------------------------------------+------------+----------+---------------------+
| id | name  |                  structarr                  |  birthday  | skillNum |     created_at      |
+----+-------+---------------------------------------------+------------+----------+---------------------+
|  1 | alice | [{"key":"profile","value":"{\"age\": 10}"}] | 2012-01-01 |        3 | 2022-01-01 12:00:00 |
+----+-------+---------------------------------------------+------------+----------+---------------------+

curl http://0.0.0.0:9050/bigquery/v2/projects/test/datasets/dataset1/tables\?prettyPrint\=false | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   385  100   385    0     0   8677      0 --:--:-- --:--:-- --:--:-- 10694
{
  "tables": [
    {
      "creationTime": "1689084629",
      "id": "test:dataset1.table_a",
      "kind": "bigquery#table",
      "tableReference": {
        "datasetId": "dataset1",
        "projectId": "test",
        "tableId": "table_a"
      },
      "type": "TABLE"
    },
    {
      "creationTime": "1689084629",
      "id": "test:dataset1.table_b",
      "kind": "bigquery#table",
      "tableReference": {
        "datasetId": "dataset1",
        "projectId": "test",
        "tableId": "table_b"
      },
      "type": "TABLE"
    }
  ],
  "totalItems": 2
}

Stop the emulator and run it using only the database file
(here I'm not using the schema in the arguments)

docker stop bq-emulator
docker run \
            --name=bq-emulator \
             --rm -v $PWD:/work \
             -p 9050:9050 \
             ghcr.io/goccy/bigquery-emulator:0.4.2 \
             --project=test \
             --database=/work/test.db

[bigquery-emulator] REST server listening at 0.0.0.0:9050
[bigquery-emulator] gRPC server listening at 0.0.0.0:9060

The data is there;

bq --api http://0.0.0.0:9050 query --project_id=test "SELECT * FROM dataset1.table_a WHERE id = 1"

+----+-------+---------------------------------------------+------------+----------+---------------------+
| id | name  |                  structarr                  |  birthday  | skillNum |     created_at      |
+----+-------+---------------------------------------------+------------+----------+---------------------+
|  1 | alice | [{"key":"profile","value":"{\"age\": 10}"}] | 2012-01-01 |        3 | 2022-01-01 12:00:00 |
+----+-------+---------------------------------------------+------------+----------+---------------------+

However, if I try to insert to data it tells me that dataset1 is not found.

pipenv run python3 insert_dataset1.py

bigquery-emulator/testdata/.venv/lib/python3.11/site-packages/google/cloud/_http/__init__.py", line 494, in api_request
    raise exceptions.from_http_response(response)
google.api_core.exceptions.NotFound: 404 GET http://0.0.0.0:9050/bigquery/v2/projects/test/datasets

/dataset1?prettyPrint=false: dataset dataset1 is not found

If I try to list it with curl the same warning.
The same request worked previously!

curl http://0.0.0.0:9050/bigquery/v2/projects/test/datasets/dataset1/tables\?prettyPrint\=false
{
  "error": {
    "errors": [
      {
        "reason": "notFound",
        "location": "",
        "debugInfo": "",
        "message": "dataset dataset1 is not found"
      }
    ],
    "code": 404,
    "message": "dataset dataset1 is not found"
  }
}

Any suggestions or ideas ?

The text was updated successfully, but these errors were encountered:

andrerainho · 2023-07-11T14:20:06Z

The script I used to insert data is this one.

# insert_dataset.py

from google.cloud import bigquery
from google.api_core.client_options import ClientOptions

your_project = "test"
your_dataset = "dataset1"
your_table = "table_a"
your_endpoint = "http://0.0.0.0:9050"
dataset_id = f"{your_project}.{your_dataset}"

data = [
  {
    "birthday": "2007-02-01",
    "created_at": "2022-01-02 18:00:00",
    "id": "3",
    "name": "forbar",
    "skillNum": "5",
    "structarr": [
      {
        "key": "profile",
        "value": "{\"age\": 15}"
      }
    ]
  }
]

if __name__ == '__main__':
    options = ClientOptions(api_endpoint=your_endpoint)
    client = bigquery.Client(project=your_project, client_options=options)

    dataset_ref = client.get_dataset(your_dataset)
    table_ref = dataset_ref.table(your_table)
    table = client.get_table(table_ref)

    try:
        print('Inserting data ...')
        client.insert_rows(table, data)
        print('Data inserted successfully.')
    except ValueError:
        print("Error: Invalid value")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

faisalhasnain · 2023-08-22T23:42:08Z

I seeing the same behavior

prismec · 2023-10-24T10:23:59Z

Same for me

totem3 · 2023-12-10T14:29:27Z

Thank you for reporting.

I am currently investigating this issue and would like to share my findings so far. Although more research is needed for a resolution, here's what I've understood to this point:

The main concern seems to come from the inability to retrieve datasets using the API. Initially, I examined the API responsible for fetching datasets. I discovered that at this point in the code, the project information is retrieved from ctx, and this project information is used to access the datasets.

When the application is launched with the --data-from-yaml option, the projects hold information about the datasets. However, when launched without this option, it seems that the projects do not retain the datasets' information.

The project information in ctx is provided by withProjectMiddleware. This, in turn, gets the project information from findProjects in repository.go, which simply fetches information from the projects table.

I've compared the projects table under two different conditions: with and without the --data-from-yaml option. The findings are as follows:

When the application is started with the --data-from-yaml option, the output of the command select * from projects is:

~/ghq/github.com/goccy/go-zetasqlite/cmd/zetasqlite-cli/zetasqlite-cli issue_207.db "select * from projects"
zetasqlite> select * from projects
+------+------------+--------+
| id   | datasetIDs | jobIDs |
+------+------------+--------+
| proj | [ds]       | []     |
+------+------------+--------+

without --data-from-yaml

~/ghq/github.com/goccy/go-zetasqlite/cmd/zetasqlite-cli/zetasqlite-cli issue_207.db
zetasqlite> select * from projects
+------+------------+--------+
| id   | datasetIDs | jobIDs |
+------+------------+--------+
| proj | []         | []     |
+------+------------+--------+

Here, the same project "proj" shows an empty array for both datasetIDs and jobIDs.

From these observations, I suspect that at some point during the application's startup process, the metadata of the project, specifically the datasets information, is being overwritten. (maybe somewhere in loader.go.) This leads to the absence of datasets information when the application is started without the --data-from-yaml option. Further investigation is needed to confirm this hypothesis and identify the exact point in the code where this overwrite might be happening.

My investigation steps were as follows:

cat  issue_207.yaml
projects:
- id: proj
  datasets:
  - id: ds
    tables:
    - id: tbl
      columns:
      - name: col1
        type: STRING

# shell 1
./bigquery-emulator --port=9050 --grpc-port=9060 --project proj --database issue_207.db --data-from-yaml issue_207.yaml 

# another shell
bq --api http://0.0.0.0:9050 --project_id proj ls

# shell 1
# kill previous db and launch another one without --data-from-yaml
./bigquery-emulator --port=9050 --grpc-port=9060 --project proj --database issue_207.db 

# another shell
bq --api http://0.0.0.0:9050 --project_id proj ls

AxTrusov · 2024-10-30T11:58:19Z

Hey! Is there any progress on this issue? Without it it's basically impossible to persist data. Or maybe there is some workaround?

jovulic · 2024-11-05T03:13:58Z

I do want to preface that I have not looked into the issue here much at all (and could be missing parts), but...

Hey! Is there any progress on this issue? Without it it's basically impossible to persist data. Or maybe there is some workaround?

If you have the willingness patch and build it yourself (and possibly if your usage of bigquery-emulator is like mine -- just calling it with the project and database flags), it looks like you can work-around the issue by removing this Server.Load call. The Server.SetProject call above should handle creating the project the first time, and nothing else within the load routine looks necessary.

And if it proves interesting/useful, below is some more information about the issue, at least from what I saw as I went poking around.

It looks like the issue is related to the Server.Load call that happens on startup when it uses the project it built using the project flag (and dataset, if specified). The problem is that the created project will not include the datasets and tables that were created external to bigquery-emulator, however, the project will be used as the authority of what exists and will write that information to the metadata-related tables.

In other words, if you run the emulator with something like bigquery-emulator --project="test_project" --database=".data", create a dataset like bq mk --api="http://127.0.0.1:9050" --dataset "test_project:test_dataset", you can see the dataset exists with bq ls --api="http://127.0.0.1:9050" --project_id="test_project". If you then stop and start the emulator with the same arguments, the dataset will then be missing as the related metadata has been clobbered when bigquery-emulator writes the metadata as "project test_project with no datasets".

The specific Server.Load call looks to be here and the part that clobbers looks to be here via the Repository.UpdateProject call as while the project does exists, the project argument to the function will not match what is stored, at least if you are using a database file and have made changes externally to bigquery-emulator.

chrisK824 · 2024-11-05T05:38:43Z

There is an open PR around that for a quite while now. #269

totem3 added bug Something isn't working PriorityHigh labels Dec 10, 2023

This comment was marked as resolved.

Sign in to view

bjmmartin mentioned this issue Feb 8, 2024

fix: prevent database rebuild when server starting #269

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset not found in pre-populated database file #207

Dataset not found in pre-populated database file #207

andrerainho commented Jul 11, 2023 •

edited

Loading

andrerainho commented Jul 11, 2023 •

edited

Loading

faisalhasnain commented Aug 22, 2023

prismec commented Oct 24, 2023

totem3 commented Dec 10, 2023

This comment was marked as resolved.

AxTrusov commented Oct 30, 2024

jovulic commented Nov 5, 2024 •

edited

Loading

chrisK824 commented Nov 5, 2024

Dataset not found in pre-populated database file #207

Dataset not found in pre-populated database file #207

Comments

andrerainho commented Jul 11, 2023 • edited Loading

andrerainho commented Jul 11, 2023 • edited Loading

faisalhasnain commented Aug 22, 2023

prismec commented Oct 24, 2023

totem3 commented Dec 10, 2023

This comment was marked as resolved.

AxTrusov commented Oct 30, 2024

jovulic commented Nov 5, 2024 • edited Loading

chrisK824 commented Nov 5, 2024

andrerainho commented Jul 11, 2023 •

edited

Loading

andrerainho commented Jul 11, 2023 •

edited

Loading

jovulic commented Nov 5, 2024 •

edited

Loading