Data Transform specs #910
Replies: 7 comments
-
Would this not be better ties to a views spec? |
Beta Was this translation helpful? Give feedback.
-
@pwalsh not sure what you mean exactly. Transforms are independent of views though commonly used with them ... |
Beta Was this translation helpful? Give feedback.
-
@rufuspollock not sure we need this open as an issue. Let's get a solid views spec, before we discuss transforms between view specs as states. |
Beta Was this translation helpful? Give feedback.
-
@pwalsh rather than close I've moved to icebox as this is a genuine issue that I think we will need for the views spec very soon. |
Beta Was this translation helpful? Give feedback.
-
UPDATE: just updated the description above with an analysis of some of the different transform systems and languages out there including vega and dp pipelines. We also have working transform support using vega dataflow in https://github.com/frictionlessdata/datapackage-render-js |
Beta Was this translation helpful? Give feedback.
-
Hi, I am new in this discussion... This framework for Data Transform, is only for data-visualization? |
Beta Was this translation helpful? Give feedback.
-
hi @ppKrauss No, this is targeted at specifying data transformations for views on data like visualisations. For a framework around traceability of data sources (data provenance), please see our Pipelines framework. |
Beta Was this translation helpful? Give feedback.
-
https://github.com/vega/vega/wiki/Data-Transforms
Appendix: Data Transform Research
Plotly Transforms
No libraries for data transform have been found.
Vega Transforms
https://github.com/vega/vega/wiki/Data-Transforms - v2
https://vega.github.io/vega/docs/transforms/ - v3
Vega provided Data Transforms can be used to manipulate datasets before rendering a visualisation. E.g., one may need to perform transformations such as aggregation or filtering (there many types, see link above) of a dataset and display the graph only after that. Another situation would be creating a new dataset by applying various calculations on an old one.
Usually transforms are defined in
transform
array insidedata
property."Transforms that do not filter or generate new data objects can be used within the transform array of a mark definition to specify post-encoding transforms."
Examples:
Filtering
https://vega.github.io/vega-editor/?mode=vega&spec=parallel_coords
This example filters rows that have both
Horsepower
andMiles_per_Gallon
fields.Geopath, aggregate, lookup, filter, sort, voronoi and linkpath
https://vega.github.io/vega-editor/?mode=vega&spec=airports
This example has a lot of transforms - in some cases there is only transform applied to a dataset, in other cases there are sequence of transforms.
In the first dataset, it applies
geopath
transform which maps GeoJSON features to SVG path strings. It usesalberUsa
projection type (more about projection).In the second dataset, it applies sum operation on "count" field and outputs it as "flights" fields.
In the third dataset:
geo
transform as in the first dataset above.voronoi
transform to compute voronoi diagram based on "layout_x" and "layout_y" fields.In the last dataset:
Further research on Vega transforms
https://github.com/vega/vega-dataflow-examples/
It is quite difficult to me to read the code as there is not enough documentation. I have included here the simplest example:
vega-dataflow.js
contains Dataflow, all transforms and vega's utilities.df
is a Dataflow instance where we register (.add) functions and parameters - as below on line 36-38. The same with adding transforms - lines 40-44. We can pass different parameters to the transforms depending on requirements of each of them. Event handlers can added by using.on
method of the Dataflow instance - lines 46-48.DP Pipelines transforms
DPP provides number of transforms that can be applied to a dataset. However, those transforms cannot be processed inside browsers as the library requires Python scripts to run.
Below is a copy-paste from DPP docs:
concatenate
Concatenates a number of streamed resources and converts them to a single resource.
Parameters:
sources
- Which resources to concatenate. Same semantics asresources
instream_remote_resources
.If omitted, all resources in datapackage are concatenated.
Resources to concatenate must appear in consecutive order within the data-package.
target
- Target resource to hold the concatenated data. Should define at least the following properties:name
- name of the resourcepath
- path in the data-package for this file.If omitted, the target resource will receive the name
concat
and will be saved atdata/concat.csv
in the datapackage.fields
- Mapping of fields between the sources and the target, so that the keys are the target field names, and values are lists of source field names.This mapping is used to create the target resources schema.
Note that the target field name is always assumed to be mapped to itself.
Example:
In this example we concatenate all resources that look like
report-year-<year>
, and output them to themulti-year-report
resource.The output contains two fields:
activity
, which is calledactivity
in all sourcesamount
, which has varying names in different resources (e.g.Amount
,2009_amount
,amount
etc.)join
Joins two streamed resources.
"Joining" in our case means taking the target resource, and adding fields to each of its rows by looking up data in the source resource.
A special case for the join operation is when there is no target stream, and all unique rows from the source are used to create it.
This mode is called deduplication mode - The target resource will be created and deduplicated rows from the source will be added to it.
Parameters:
source
- information regarding the source resourcename
- name of the resourcekey
- One of{<field_name_1>}:{field_name_2}
)delete
- delete from data-package after joining (False
by default)target
- Target resource to hold the joined data. Should define at least the following properties:name
- as insource
key
- as insource
, ornull
for creating the target resource and performing deduplication.fields
- mapping of fields from the source resource to the target resource.Keys should be field names in the target resource.
Values can define two attributes:
name
- field name in the source (by default is the same as the target field name)aggregate
- aggregation strategy (how to handle multiple source rows with the same key). Can take the following options:sum
- summarise aggregated values.For numeric values it's the arithmetic sum, for strings the concatenation of strings and for other types will error.
avg
- calculate the average of aggregated values.For numeric values it's the arithmetic average and for other types will err.
max
- calculate the maximum of aggregated values.For numeric values it's the arithmetic maximum, for strings the dictionary maximum and for other types will error.
min
- calculate the minimum of aggregated values.For numeric values it's the arithmetic minimum, for strings the dictionary minimum and for other types will error.
first
- take the first value encounteredlast
- take the last value encounteredcount
- count the number of occurrences of a specific keyFor this method, specifying
name
is not required. In case it is specified,count
will count the number of non-null values for that source field.set
- collect all distinct values of the aggregated field, unorderedarray
- collect all values of the aggregated field, in order of appearanceany
- pick any value.By default,
aggregate
takes theany
value.If neither
name
oraggregate
need to be specified, the mapping can map to the empty object{}
or tonull
.full
- Boolean,True
(the default), failed lookups in the source will result in "null" values at the source.False
, failed lookups in the source will result in dropping the row from the target.Important: the "source" resource must appear before the "target" resource in the data-package.
Examples:
The above example aims to create a package containing the GDP and Population of each country in the world.
We have one resource (
world_population
) with data that looks like:And another resource (
country_gdp_2015
) with data that looks like:The
join
command will match rows in both datasets based on thecountry_code
/CC
fields, and then copying the value in thecensus_2015
field into a newpopulation
field.The resulting data package will have the
world_population
resource removed and thecountry_gdp_2015
resource looking like:A more complex example:
This example aims to analyse salaries for screen actors in the MGM studios.
Once more, we have one resource (
screen_actor_salaries
) with data that looks like:And another resource (
mgm_movies
) with data that looks like:The
join
command will match rows in both datasets based on the movie name and production year. Notice how we overcome incompatible fields by using different key patterns.The resulting dataset could look like:
Beta Was this translation helpful? Give feedback.
All reactions