Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas_corr(FULL_TABLE_ID STRING) #133

Open
2 tasks done
ThomasEllyatt opened this issue Jan 22, 2024 · 0 comments
Open
2 tasks done

pandas_corr(FULL_TABLE_ID STRING) #133

ThomasEllyatt opened this issue Jan 22, 2024 · 0 comments
Labels
new-bigfunction Suggest a New BigFunction

Comments

@ThomasEllyatt
Copy link
Contributor

Check the idea has not already been suggested

Edit the title above with self-explanatory function name and argument names

  • The function name and the argument names I entered in the title above seems self explanatory to me.

BigFunction Description as it would appear in the documentation

This stored procedure, pandas_corr, is designed to replicate the functionality of the corr function found in the Pandas Python package, specifically tailored for BigQuery. It calculates correlation coefficients between all pairs of numeric fields in a specified table.

The procedure neatly organizes the output by row number and field name, mirroring the structured and intuitive output format provided by Pandas.

Examples of (arguments, expected output) as they would appear in the documentation

To execute this procedure, you need to specify the full table ID as a string variable. Below is an example demonstrating how to use the procedure with the 311_service_requests public table.

CALL us_dataset.pandas_corr('bigquery-public-data.austin_311.311_service_requests');

When run, the user will receive three job results.

  1. Provides an array of the eligible fields from the provided table which will be used to build the correlation map.
  2. Provide the query this procedure generated (just for reference, good to know how it works under the hood sometimes).
  3. Provides the correlation map, similar to the PANDAS corr function.
image

I've included below the code I've pulled together to create this function.

CREATE OR REPLACE PROCEDURE `spreadsheep-20220603.Dashboard_Datasets.pandas_corr`(FULL_TABLE_ID STRING)
BEGIN

DECLARE COLUMN_NAMES ARRAY<STRING>;
DECLARE QUERY STRING DEFAULT '';
DECLARE OUTER_LOOP, INNER_LOOP INT64 DEFAULT 0;
DECLARE PROJECT_DATASET, TABLEID STRING;

SET PROJECT_DATASET = SPLIT(FULL_TABLE_ID, ".")[0] || "." || SPLIT(FULL_TABLE_ID, ".")[1];
SET TABLEID = SPLIT(FULL_TABLE_ID, ".")[2];

EXECUTE IMMEDIATE
(
"SELECT array_agg(column_name) as column_names FROM " || PROJECT_DATASET || ".INFORMATION_SCHEMA.COLUMNS WHERE table_name = '" || TABLEID || "' AND data_type IN ('INT64','FLOAT64','NUMERIC')"
) INTO COLUMN_NAMES
;

LOOP

    IF 
        OUTER_LOOP > ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN LEAVE;
    ELSE
        SET QUERY = QUERY || "SELECT '" || OUTER_LOOP || "' as row_number, '" || COLUMN_NAMES[OUTER_LOOP] || "' as field, \n";

        LOOP

            IF
                INNER_LOOP > ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN LEAVE;
            ELSE
                SET QUERY = QUERY || "CORR("|| COLUMN_NAMES[OUTER_LOOP] || ", " || COLUMN_NAMES[INNER_LOOP] || ") as " || COLUMN_NAMES[INNER_LOOP] || ", \n";
            END IF;

            SET INNER_LOOP = INNER_LOOP + 1;

        END LOOP;
        SET INNER_LOOP = 0;

        IF
            OUTER_LOOP < ARRAY_LENGTH(COLUMN_NAMES) - 1 THEN SET QUERY = QUERY || "FROM `" || PROJECT_DATASET || "." || TABLEID || "`\n\n UNION ALL \n\n";
        ELSE 
            SET QUERY = QUERY || "FROM `" || PROJECT_DATASET || "." || TABLEID || "`";
        END IF;

    END IF;
    SET OUTER_LOOP = OUTER_LOOP + 1;

END LOOP;

SELECT QUERY;
EXECUTE IMMEDIATE(QUERY || " ORDER BY row_number");

END;
@ThomasEllyatt ThomasEllyatt added the new-bigfunction Suggest a New BigFunction label Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-bigfunction Suggest a New BigFunction
Projects
None yet
Development

No branches or pull requests

1 participant