Random Forest: Training/Regression, Classifier/Predicting... #295

m-mohr · 2021-10-26T15:05:53Z

We need two (or one?) new processes for Random Forest that support classification and regression.

Would training happen outside of openEO for now?

Implementations:

Fortran: https://www.stat.berkeley.edu/users/breiman/RandomForests/cc_manual.htm
R randomForest: https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest
R ranger: https://www.rdocumentation.org/packages/ranger/versions/0.13.1/topics/ranger
Python sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html / https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
Spark MLlib: http://spark.apache.org/docs/latest/mllib-ensembles.html#random-forests
Check the eo-learn implementation from Sinergise
and many more...

PS: That's a lot of parameters, wow!

-> Related: save_model / load_model with GLMLC metadata: #300

jdries · 2021-10-27T06:25:20Z

We'll need training as well, as the saved model formats may be specific to the implementation used?
I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

m-mohr · 2021-10-27T09:37:39Z

We'll need training as well, as the saved model formats may be specific to the implementation used?

Ok good. I wasn't sure whether this would be provided through file upload but that's actually not yet a thing in Platform.

I'm also not sure if the process should be limited to random forest, we're also thinking about using catboost, which supports nodata.

I guess that depends a lot on how the individual processes for training, classification and regression would look like afterwards. If you have a lot of parameters, they should probably be separate otherwise you end up in a mess with schemas. If they are just "choose a method and a file" or so, we might be able to merge them into a generic one. Let's see, I still need to do more research as I don't have a lot of experience with all this, unfortunately...

mattia6690 · 2021-12-09T15:56:15Z

Recap of today's meeting on the randomForest process:

need to flatten the data from Vector cube to 2d (table-like) object as RF input
specifiy dimension(s) of a vector cube that act as predictors for the model
Two separate functions for training and prediction
New process for sampling might be useful in the future (new Issue already @m-mohr?)
Processes will be based on vector cubes instead of raster cubes to allow for more flexibility to the user (e.g. import of Polygons and Lines possible)

For more information, I put the Presentation here. This is a kickstarter for the UC8 implementation

jdries · 2021-12-13T11:18:57Z

Some feedback based on internal discussion at VITO:

the landcover use case will require prediction on raster cubes, training can happen on vector cubes. (We need to produce a map at the end.)
for training, we can convert our polygons into a set of points (offline), where we basically sample each polygon with a number of points. That would allow us to use a process like 'aggregate_spatial' for the raster to vector conversion, because the use of points has the effect that original pixel values are maintained.
in our case, the flattening has been taken care of by apply_dimension, but it's fine if another process is defined for that (doing the same thing)

m-mohr · 2021-12-13T13:44:31Z

New process for sampling might be useful in the future (new Issue already @m-mohr?)

Yes, quickly opened one here: #313

edzer · 2021-12-13T19:40:01Z

Thanks, helpful! Here is a sketch of the process(es), as I see them, high-level (for pixel-wise ML methods, such as RF). Following the ML terminology, I use labels for the response (e.g. crop type; either a class variable or a continuous variabe) and features for the predictors (e.g. the bands, or bands x time, based on which a RF predicts a class given a model).

As @mattia6690 notes, there are two separate steps: A train model, B predict on new features

A train model

input: "locations" (points, pixels) with:
- labels
- features
input: hyper-parameters
output: "model"

See below for how we get to these input data, e.g. from polygons

B Predict (classify, regress)

input: data: raster data cube with features as a dimension
input: dimension: feature dimension name
input: context: "model"
two options:
- B1: we only predict a class, or a scalar
  - input: reducer: needs to be defined: takes the model, returns the class
  - output: data cube with labels (class, or cont. variable)
  - this is (a special case for) reduce_dimension
- B2: we want probabilities for each class (a standard option for any classifier)
  - input: process: needs to be defined: takes the model, returns the class probabilities
  - output data cube with probabilities per class (class is dimension, probability the attribute)
  - this is (a special case for) apply_dimension with target_dimension = "class"

data for A: train model

Typical steps needed before we can train the model (A3) are:

case A1: training data consist of polygons and their class values, where polygons are uniform in their class value. This needs a method to either:
- A1.1 sample points within the polygons, given some sampling strategy (random? regular?) and sample size
- A1.2 given a raster data cube, find all the pixel centers within the polygons: may require a new process ("extract?" - could be combined with A2):
  - input: polygon geometries
  - input: raster data cube
  - output: POINT geometries of pixel centers inside the polygon, with associated polygon ID
output of A1: Point locations + labels -> go to case A2
case A2: extract features at the training point locations: we think this should happen with aggregate_spatial when called with POINT geometries (although no aggregation takes place):
- input: data: raster data cube with features
- input: geometries: POINT locations + labels
- input: reducer: array_element with index 0
- output: point locations + features at these points -> go to case A3
case A3: train model:
- input: point locations + features (output of A2)
- input: hyper-parameters
- output: model

Note that step A1.2 + A2: for a set of polygons and a raster (cube), return the raster pixel centers and all the associated pixel values, is a very common operation; in R it is usually called extract.

jdries · 2021-12-14T08:58:08Z

Nice overview!
For A1.1 we will first write a script that does this client side, where we have all flexibility to do that in whatever way we like, but I'm not opposed to also defining it as a process, same for A1.2, which seems even simpler.

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

m-mohr · 2021-12-14T10:24:44Z

For prediction (B1/B2), instead of having special cases of apply/reduce dimension, could a prediction process also simply be a callback? Wouldn't that integrate better in the whole processes framework?

Yes, that's actually what we discussed yesterday but Edzer did not mention it explicitly. So to visualize it with a bit of JS-like pseudo-code for B1:

p = new ProcessBuilder()
cube = p.load_collection('S2')
model = p.load_ml_model('my_model_job')
reducer = function(data, context) {
  return this.predict_rf(data = data, model = context)
}
x = p.reduce_dimension(data = cube, reducer = reducer, dimension = 'bands', context = model)
...

Not fully fleshed out yet, but to give an idea...

m-mohr added the new process label Oct 26, 2021

m-mohr added this to the 1.3.0 milestone Oct 26, 2021

m-mohr self-assigned this Oct 26, 2021

m-mohr changed the title ~~Random Forest: Classifier and Regression~~ Random Forest: Training, Classifier and Regression Oct 27, 2021

m-mohr changed the title ~~Random Forest: Training, Classifier and Regression~~ Random Forest: Training, Classifier, Regression, Predict... Oct 27, 2021

m-mohr changed the title ~~Random Forest: Training, Classifier, Regression, Predict...~~ Random Forest: Training/Regression, Classifier/Predicting... Nov 12, 2021

m-mohr assigned mattia6690 Nov 12, 2021

m-mohr modified the milestones: 1.3.0, 1.2.0 Nov 12, 2021

m-mohr added a commit that referenced this issue Nov 18, 2021

New processes for random forest #295

761b147

m-mohr linked a pull request Nov 18, 2021 that will close this issue

Processes for Random Forest #306

Merged

m-mohr modified the milestones: 1.2.0, 1.3.0 Nov 29, 2021

This was referenced Dec 13, 2021

Additional use cases #231

Closed

Sampling #313

Closed

m-mohr added the ML label Dec 13, 2021

m-mohr closed this as completed Mar 9, 2022

LukeWeidenwalker mentioned this issue Apr 12, 2022

predict_random_forest should be a reducer Open-EO/openeo-processes-python#154

Closed

m-mohr modified the milestones: 1.3.0, 2.0.0 Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random Forest: Training/Regression, Classifier/Predicting... #295

Random Forest: Training/Regression, Classifier/Predicting... #295

m-mohr commented Oct 26, 2021 •

edited

Loading

jdries commented Oct 27, 2021

m-mohr commented Oct 27, 2021 •

edited

Loading

mattia6690 commented Dec 9, 2021

jdries commented Dec 13, 2021

m-mohr commented Dec 13, 2021

edzer commented Dec 13, 2021 •

edited

Loading

jdries commented Dec 14, 2021

m-mohr commented Dec 14, 2021 •

edited

Loading

Random Forest: Training/Regression, Classifier/Predicting... #295

Random Forest: Training/Regression, Classifier/Predicting... #295

Comments

m-mohr commented Oct 26, 2021 • edited Loading

jdries commented Oct 27, 2021

m-mohr commented Oct 27, 2021 • edited Loading

mattia6690 commented Dec 9, 2021

jdries commented Dec 13, 2021

m-mohr commented Dec 13, 2021

edzer commented Dec 13, 2021 • edited Loading

A train model

B Predict (classify, regress)

data for A: train model

jdries commented Dec 14, 2021

m-mohr commented Dec 14, 2021 • edited Loading

m-mohr commented Oct 26, 2021 •

edited

Loading

m-mohr commented Oct 27, 2021 •

edited

Loading

edzer commented Dec 13, 2021 •

edited

Loading

m-mohr commented Dec 14, 2021 •

edited

Loading