Desired data transformations in analysis workflow #11

suvayu · 2024-02-21T19:18:26Z

Terminology

data sources: a tabular dataset with arbitrary schema (column
names & types) in a local file; could be CSV, Excel, Parquet,
JSON(LD), or a database
transformation: a change applied to the dataset that either
updates its values, or changes it's schema (renaming columns, or
setting data types)
pipeline/workflow: a chain of transformations applied to a
(set of) dataset(s)

Desired input

It would be good to collect the kind of data transformations that a
user might want to do before using the final result as TEM input.

A. read different data sources, and merge them, possibly dropping
some columns
B. Function to read different data sources and replace columns #14
- fill missing values from the more complete source
  e.g. you want to change the assets that are investable, so you
  create a new CSV with only the the assets you want to change, and
  the corresponding investable column. This transform will create a
  new table where the investable column has the new values for the
  assets you want to change, and the old values for the assets you
  haven't specified.
C. Function to set a column to a value #13
D. apply a constant scaling factor to a column
- apply a scaling factor to a column that varies per row
E. transform a column by passing it through an arbitrary
mathematical function
e.g. maybe you want to normalise a column
G. ability to filter on rows and apply the above transforms
H. transformations that you can apply to non-tabular sources to get
a tabular dataset

We can create issues from this list, and mark them "Now" or "Soon".

Workflow wishlist

Apply function on column of filtered table (EXAMPLE: for type=consumer, change investment_cost to +1000)
Merge file or table with Julia dataframe
Combine tables with diff IDs (thus adding rows)
Combine tables where same cols have diff col names
Read table into dataframe
(Migrated from #289)
Able to visualize and inspect intermediate datasets
Able to compare and inspect multiple runs (e.g. different scenarios)

Future outlook

The idea is to provide canned Julia recipes (functions) that wrap the
complexity of some of these steps. Then a future user can simply
write a Julia script that looks like somewhat like this:

import TulipaIO: source, @transform1, @transform2, @save, @read, @dump
import TulipaEnergyModel as TEM

source("<file>") |> @transform1 params... |> transform2 params... |> @save "name"

@read "name" |> TEM.create_model |> @dump "results"

This is of course an overly simplified example, think of this as
"Eventually".

Unsolved issues

As a user works on a project, tries out different data processing steps, they will probably create many tables in DuckDB. This will easily turn into a mess. There are two mitigations:

A policy to use TEMPORARY tables (tables that are transient, and disappear after a session) under circumstances when we know an intermediate table isn't useful later
An accounting system that keeps track of the table by adding metadata to make it findable. Could be file based (JSON), or add a metadata table in DuckDB.

Dev setup

The transformations mentioned earlier can be in SQL or Julia, or a mix
of both. So it would be good to have SQL recipes that do some of the
simpler ones, and for the more custom ones we can use Julia, and
combine them to get the best of both worlds.

To develop the SQL recipes, the simplest setup is to use the DuckDB
CLI, and try to work with CSV files.

Install DuckDB CLI

Start it up and play ($ duckdb), e.g. something like below:

SELECT * FROM read_csv_auto('path/to/my.csv', header=true, skip=1);
CREATE TABLE my_tbl as SELECT * FROM read_csv_auto('path/to/my.csv', header=true, skip=1);

The other option is to use the Julia client:

start up a Julia session and activate TulipaIO as usual
from Julia start a DuckDB session following the recipes shown here.

do similar experiments as above

import DataFrames as DF
import DuckDB: DBInterface, DB
con = DBInterface.connect(DB) # in-memory DB
res = DBInterface.execute(con, "SELECT * FROM read_csv_auto('path/to/my.csv', header=true, skip=1)")
df = DF.DataFrame(res.tbl)

any recipes we develop here, we can then wrap them as Julia
functions.
note: the "Scanning DataFrames" section shows how to do the 2nd
& 3rd part of our goals, doing more mathematically complex tasks in
Julia, and bringing them together in SQL.

CC: @clizbe

The text was updated successfully, but these errors were encountered:

clizbe · 2024-02-22T09:40:23Z

I really like bullet 3 (doing a partial replacement) as this would reduce the amount of filtering necessary.

Do we always require the user to have a second file that they're merging with the main? Or can we give them the option to "type in a value" that is applied to a bunch of rows? (Aka, set all specified rows to "Active")

As for the Julia wrappers - I think first we should just document the SQL/Julia way of doing things. If people have that cookbook, they can do a lot.
Then we can start looking into simplifying things they're going to type over and over again. I'm a little worried about starting with the simplification, since it adds a layer of abstraction, not having the user understand what's happening. What do you think?

suvayu · 2024-02-22T09:58:32Z

Do we always require the user to have a second file that they're merging with the main? Or can we give them the option to "type in a value" that is applied to a bunch of rows? (Aka, set all specified rows to "Active")

A file is not mandatory. For example for your example we need filter and set, while set is something we can do now, filter is not yet possible. So definitely another requirement.

I'm a little worried about starting with the simplification, since it adds a layer of abstraction, not having the user understand what's happening.

Good point, I think maybe now it looks a bit confusing because you don't see all the steps one after the other, and it's also a rather advanced feature. How about I first create a few example sessions based on the test datasets, and then you can create a tutorial out of it? The tutorial can have more conceptual explanations to make it easier to follow what's happening under the hood.

clizbe · 2024-02-26T12:21:21Z

I think with respect to saving the data files for later, we can come up with a decision and then build it into the workflow and documentation - explaining why we do it the way we do it. But of course taking whatever way they would do it naturally and automating it / adding meta information would be nicest to the user.

clizbe · 2024-02-26T14:53:23Z

@suvayu This is bizarre:
If I run this code in a script, I get an error:

con = DBInterface.connect(DB) # in-memory DB
res = DBInterface.execute(con, "SELECT * FROM read_csv_auto('test/data/Norse/assets-data.csv', header=true, skip=1)")
my_df = DF.DataFrame(res.tbl)

But if I run it by line in the Julia terminal, it works fine.

clizbe · 2024-02-26T15:19:59Z

I think C. should be an optional argument, so the user can choose whether to fill missing values from the original source or leave them blank (or set to 0 maybe).

suvayu · 2024-02-28T09:25:39Z

I think I've seen this error before. I would like to understand it better. Could you open a bug report and provide a recipe to reproduce? Even if it's not an actual bug, it would be best to understand the behaviour.

I think C. should be an optional argument, so the user can choose whether to fill missing values from the original source or leave them blank (or set to 0 maybe).

If they don't want to fill the missing values, can't they use the replace columns transform (B)?

clizbe · 2024-02-28T13:04:34Z

Sure I'll make a bug report.

Yes that's what I mean - to me this should be an option for B.
Something like this maybe?

General behavior: Replace column X with column Y
Options:
- Missing Values = X
- Missing Values = 0
- Missing Values = [empty]
- Missing Values = Error/Warning
(Missing values in Y)

suvayu · 2024-02-28T15:26:05Z

Okay, that makes sense, I think the current implementation assumes (or maybe implements) an INNER JOIN. I think LEFT JOIN is more useful than INNER JOIN.

I guess we should uncheck B and convert it to an issue as well :-p

suvayu · 2024-02-28T15:30:17Z

Actually B should be merged with C. I'll do it

clizbe · 2024-04-22T13:35:38Z

clizbe · 2024-07-30T09:57:03Z

@suvayu Another situation I discussed with someone from OPERA the other day:
Two people are working on the same "core" data and will want to merge their changes eventually.
One guy is working on the structure of the data - as in, breaking a region into smaller subregions and updating all the tables, etc to reflect this increased detail.
The other guy is changing numbers in tables, to do updates and things. Maybe the same tables the other guy is restructuring, but we don't really know.

This might not be a problem with the way we're separating raw data/model/scenarios, but it might still be an issue. I'm not sure.

suvayu · 2024-07-31T09:54:40Z

What is the difference between "updating all the tables" and "changing numbers in tables"?

Does the first mean changing the schema, i.e. changing columns, what constitutes a unique row, etc? Whereas the second would mean, update the source dataset to newest release, change weather year, or inflation calculation, etc?

clizbe · 2024-07-31T10:52:35Z

Yeah it's kind of hard to relate their structure to ours.

One guy is changing the structure of the data. So maybe making new rows and updating old rows. Maybe duplicating a table and filling it in with new information.

The other guy is updating the source data, maybe in the same tables where the structure is changing.

So the dumb-method to merge them is to copy-paste the new structure tables into the updated source database. BUT that would lose any possible updates in that same table.

And as I understand it, the structural changes sort of break the data/scenario until they're complete, so it isn't really possible to do them simultaneously in the same database. (At least with their way of working.)

EXAMPLE ATTEMPT:

Person 1 adds a new column to Table A to assign all cities to 'Regions.' Until this is done, the model can't handle the WIP database correctly.
Person 2 is updating the existing data in Table A. They are doing runs with the WIP data as they add more and change things.
Eventually they want to merge both copies of Table A.

Like I said, with the way we're going to work, this might not even be a problem.

clizbe · 2024-07-31T10:56:02Z

Or maybe Person 1 is just adding rows, but Person 2 doesn't want those rows interfering with what he's working on as they appear over time. Basically a version control issue.

suvayu · 2024-07-31T13:44:58Z

Adding columns doesn't really interfere with what person 2 is doing, but removing a column would. As for adding rows, depending on the operation it may or may not interfere. Basically any kind of aggregation will break. But there's a technique where you add a date column that is your "version" (since it's date, a common name is "asof"), then you can do your query as SELECT * FROM table WHERE asof = '2024-05-01'. New rows will have have a different asof date, so they won't interfere. I guess if your data isn't frequently changing, it could also be just a version number.

Note that this technique assumes no data is ever removed. I also don't know what other knock-on effects it might have; e.g. it's possible normal queries become a bit more complex and you incur a small performance hit.

suvayu · 2024-07-31T14:26:22Z

Also, another technique would be person 1 does their changes as logic (transformations) instead of directly working on the tables. I don't think this will cover everything.

suvayu added epic Epic issues (collection of smaller tasks towards a goal) Zone: data & import labels Feb 21, 2024

This was referenced Feb 26, 2024

B. Function to read different data sources and replace columns #14

Closed

C. Function to set a column to a value #13

Closed

clizbe removed the Zone: parsers label Mar 5, 2024

suvayu added the Zone: user functions label May 6, 2024

clizbe mentioned this issue Sep 19, 2024

Build the input workflow TulipaEnergy/TulipaEnergyModel.jl#289

Closed

20 tasks

suvayu mentioned this issue Oct 22, 2024

pipeline.jl: refactor to make API uniform #80

Merged

4 tasks

clizbe removed the Zone: user functions label Dec 4, 2024

clizbe mentioned this issue Dec 17, 2024

Improve the INPUT workflow TulipaEnergy/TulipaEnergyModel.jl#974

Open

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Desired data transformations in analysis workflow #11

Desired data transformations in analysis workflow #11

suvayu commented Feb 21, 2024 •

edited by clizbe

Loading

clizbe commented Feb 22, 2024

suvayu commented Feb 22, 2024

clizbe commented Feb 26, 2024 •

edited

Loading

clizbe commented Feb 26, 2024 •

edited

Loading

clizbe commented Feb 26, 2024

suvayu commented Feb 28, 2024

clizbe commented Feb 28, 2024 •

edited

Loading

suvayu commented Feb 28, 2024 •

edited

Loading

suvayu commented Feb 28, 2024

clizbe commented Apr 22, 2024

clizbe commented Jul 30, 2024

suvayu commented Jul 31, 2024

clizbe commented Jul 31, 2024 •

edited

Loading

clizbe commented Jul 31, 2024

suvayu commented Jul 31, 2024

suvayu commented Jul 31, 2024 •

edited

Loading

Desired data transformations in analysis workflow #11

Desired data transformations in analysis workflow #11

Comments

suvayu commented Feb 21, 2024 • edited by clizbe Loading

Terminology

Desired input

Workflow wishlist

Future outlook

Unsolved issues

Dev setup

clizbe commented Feb 22, 2024

suvayu commented Feb 22, 2024

clizbe commented Feb 26, 2024 • edited Loading

clizbe commented Feb 26, 2024 • edited Loading

clizbe commented Feb 26, 2024

suvayu commented Feb 28, 2024

clizbe commented Feb 28, 2024 • edited Loading

suvayu commented Feb 28, 2024 • edited Loading

suvayu commented Feb 28, 2024

clizbe commented Apr 22, 2024

clizbe commented Jul 30, 2024

suvayu commented Jul 31, 2024

clizbe commented Jul 31, 2024 • edited Loading

clizbe commented Jul 31, 2024

suvayu commented Jul 31, 2024

suvayu commented Jul 31, 2024 • edited Loading

suvayu commented Feb 21, 2024 •

edited by clizbe

Loading

clizbe commented Feb 26, 2024 •

edited

Loading

clizbe commented Feb 26, 2024 •

edited

Loading

clizbe commented Feb 28, 2024 •

edited

Loading

suvayu commented Feb 28, 2024 •

edited

Loading

clizbe commented Jul 31, 2024 •

edited

Loading

suvayu commented Jul 31, 2024 •

edited

Loading