Skip to content

Unable to access newly created columns in BigQuery table schema #659

@Josh-Myers

Description

@Josh-Myers

Bigrquery seems to assume that the table schema won't change, which can cause issues. The reprex demonstrates that if you create a table in BigQuery and then add a column, you can't access the new column using dplyr syntax. I'm told that this is due to bigrquery getting the schema from a cache.

We have a couple of work arounds, but neither are ideal:

  1. If you disconnect the connection, wait up to a minute or so and then reconnect, it can see the new column.
  2. Redefining bq_table_meta (https://github.com/r-dbi/bigrquery/blob/main/R/bq-table.R) as below works but has been inconsistent when I have tested it. Sometimes just waiting the minute or so works, or sometimes I have had to run cars_new_col_qu2 = tbl(con, cars_table) in the reprex twice to get it to work.
bq_table_meta <- function (x, fields = NULL) {
    x <- as_bq_table(x)
    url <- bq_path(x$project, x$dataset, x$table)
    bq_get(url, useQueryCache = FALSE, query = list(fields = fields))
}

Reprex (uses the standard bigrquery bq_table_meta, not the custom one above):

library(DBI)
library(bigrquery)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql
library(glue)

# Replace xxxx with your values
project = "xxxx"
dataset = "xxxx"
test_id = round(as.numeric(Sys.time()))
cars_table = glue("cars_test_{test_id}")

con = dbConnect(
  bigquery(),
  project = project,
  dataset = dataset,
  billing = project
  )

# Create table in database
cars_df = as.data.frame(mtcars)
dbWriteTable(con, cars_table, cars_df)

# Verify it is there
cars_qu = tbl(con, cars_table)
colnames(cars_qu)
#>  [1] "am"   "carb" "vs"   "qsec" "wt"   "drat" "disp" "hp"   "cyl"  "gear"
#> [11] "mpg"

# Make a new column and write to the table
cars_new_col_qu = cars_qu |> 
  mutate(new_col = cyl / mpg)

cars_new_col_qu_string = sql_render(cars_new_col_qu)
dbExecute(con, glue("create or replace table {cars_table} as {cars_new_col_qu_string}"))
#> [1] 0

# Query the table with the new column (this is the same query as cars_qu)
cars_new_col_qu2 = tbl(con, cars_table) 

# Doesn't show new_col
colnames(cars_new_col_qu2)
#>  [1] "am"   "carb" "vs"   "qsec" "wt"   "drat" "disp" "hp"   "cyl"  "gear"
#> [11] "mpg"

# Gives an error because new_col isn't in the schema
cars_new_col_qu3 = cars_new_col_qu2 |> 
  select(mpg, cyl, new_col) 
#> Error in `select()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `new_col` doesn't exist.

# But it does exist which we can see by querying the table directly
cars_table_schema = dbColumnInfo(dbSendQuery(con, glue("select * from `{project}.{dataset}.{cars_table}` limit 0")))
print(cars_table_schema)
#>       name    type
#> 1       am INTEGER
#> 2     carb INTEGER
#> 3       vs INTEGER
#> 4     qsec   FLOAT
#> 5       wt   FLOAT
#> 6     drat   FLOAT
#> 7     disp   FLOAT
#> 8       hp INTEGER
#> 9      cyl INTEGER
#> 10    gear INTEGER
#> 11     mpg   FLOAT
#> 12 new_col   FLOAT

# Clean up
dbExecute(con, glue("drop table {cars_table}"))
#> [1] 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions