-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processes for Random Forest #306
Conversation
Could we create a copy of fit_regr_random_forest.json for classification? Like fit_class_random_forest.json ? |
@clausmichele Sure, go ahead :-) |
Ok, I'll add them later. |
The regression and classification processes look very similar. Would it make sense to merge them to a |
Yeah they are quite similar but they work differently, especially for the split criterion. I think we could make both but programming them like in Pythin you'd always need two separate functions since they are based on different input parameters. Here are some infos about Classification and Regression. If you think we have time and/or need a classifier in future it could be worthwile having two functions for the random Forest. I'd give the regression function implementation a higher priority though |
In the sklearn example the only difference seems to be the criterion, right? By the way, what are the "Attributes" in sklearn? Are that some kind of different parameters or is that what is returned? |
The algorithm behind classifier and regression is the same afaik. This spit criterion differentiation is very important though. While in a classification you work more on discrete classes (e.g. forest / non forests / impervious / agricutlural field) the regression is working with numerical values. Therefore the classifier ultimately leads to one class of the input datasets' being chosen and assigned to a pixel. In the UC8 the input and output will be numerical and therefore the regression is needed. Nevertheless, we could include a criterion that does a classification but bear in mind that we would need two separate functions to be called (and I don't know whether this is widely implemented outside of R and Pythin) and I think also some additional errors to be thrown, if the input data does not correspond to the needed.
@clausmichele correct me if I am wrong but I think the attributes are the final output of the randomforest regression / classification. These attributes are usually the results connected to the "model-object" returned in Python |
Yes, the attributes are all the information about the model that we have created. Maybe they could be exposed as metadata with the stored model or in the logs. However, I don't think this has high priority now. |
I'm trying to figure out which data to pass to fit_regr_random_forest:
This could be the result of aggregate_spatial using a geoJSON with the training areas as Geometries in a FeatureCollection right?
This is also the result of aggregate_spatial right? Maybe I misunderstood some stuff, please help me :) |
Here the information about the target variable (in UC8 the fractional canopy cover) is needed for basically a point that could represent the center of the Pixel resulting from the upscaling from VHR data resolution to medium (20m) resolution. I think a GeoJSON with coordinates and FCC values is enough to later extract the predictors (below)
I am not sure what you mean by reducer in this case. The important part is that the target points are associated with a value from each of the predictor rasters. We also thought that aggregate_spatial should be the best openEO function to do so. In synthesis This would mean that for ONE Pixel/Point X associated to ONE information about the target t ONE information for EACH of the predictors predictor_1...predictor_x will be extracted. This feature space (t,predictor_1,predictor_2,...,predictor_x) is the input for the random Forest model. I hope this was somewhat clearer? Otherwise let me know and I will try to explain a bit more in detail |
For our spark based implementation, Looking at CatBoost: https://catboost.ai/en/docs/concepts/python-reference_catboostregressor_fit |
I think that it corresponts to featureSubsetstrategy. In this link it is described as follows: featureSubsetStrategy: Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low. |
@m-mohr why can't I connect the processes in this way? Ok I see, we defined "data" as raster-cube, but in my opinion, after yesterday discussion, it should also be a vector-cube |
Yes, that's probably the case. We did not update it since yesterday. |
I think this could be solved directly in the process, transforming the data into feature vectors depending of the input data, like: Input dimensions: (result, variable, time)
where we have row = length of result dimension Input dimensions: (result, variable)
|
Thak you Michele, that looks great already. This is a very interesting first structure of the process graph!
The temporal dimension is difficult to deal with both in regression and classificiation. I could think of having the "time" or "time interval" as a predictor in the model. E.g.; In case of the forest the time itself is not that important but an aggregation to 2 (summer winter) or 4 (+spring and autumn) seasons is extremely useful for a precise prediction. Therefore we'd need to associate a target value to the season. This season would be then an additional predictor in the model. |
Aggregating over the seasons would be already possible and we could just rename the output bands with the period attached, something like: |
Yes indeed |
For the target vector cube of fit_class_random_forest, should the user be responsible for int encoding the labels ([0,1,2,...]) or can he provide labels of any type e.g. ["Wheat", "Barley", ...]? For the latter, the back end would be responsible for keeping track of the encoding in the model to provide the correct results when the predict_random_forest process is called. |
Good question. In my opinion handling integers would be enough, we can't anyway generate as output a raster containing strings. So the mapping can be done easily at the user side. |
From a user POV, I think the back-end should try to handle this internally without user intervention. Not sure how feasible this is though. |
This is actually a very good point @JeroenVerstraelen. I think that it would be very beneficial to use just number representation since the RF regression is based on continuous numerical variables only. It cannot deal with characters/strings. I think that using just numbers we would meet the requirements of both approaches. EDIT: I misread that this comment is just related to the classification. In this case I agree with @m-mohr that the user would benefit from having both opportunities, if it is feasible in the implementation. |
What's still a bit unclear perhaps: how is the prediction vector cube to be joined with the target vector cube, by spatial join, ordering along a given dimension, or some property? Note, in line with our discussion early today, this is a case where you don't really need vector-cube, but could use a more generic 'datacube' instead. I believe the most important requirement is that the cubes are 2-dimensional: samples x predictors and samples x target? |
If the vector-cube contains the
We have seen that it's difficult define the behavior with vector-cubes, without an API definition, maybe we can keep it like this until we define them.
I agree about the input definition: if we don't let the user select particular dimensions or properties of the vector-cube and we keep data and target separated, this could be indeed a more general |
I report a comment from Lukas:
Should we change from percentage to float values between 0 and 1 for the train/test values? |
Is it just scikit that does this or do we have more examples? If someone is talking about the "community" I'd expect at least two distinct examples (e.g. in R or so). But in principle I think we can also go with 0-1, too. |
@soxofaan @m-mohr @mattia6690 @jdries |
Ah, didn't get the notification for being tagged here, sorry for the late response!
A counterexample I've come across is Tensorflow, which uses strings to specify this: https://www.tensorflow.org/datasets/splits I'm less familiar with the ecosystem in R, but afaics |
@clausmichele If this is commonly supported in libraries, I'd say yes. (Edit: Added, please review) @LukeWeidenwalker Thank you. That is far more evidence than I had hoped for. So I'll update this to be 0-1. (Edit: Changed, please review) What other open points do we have for these processes? It's hard to follow all the discussion here so I'd like to collect open issues here. As far as I can see there are still some uncertainties regarding the definition of the vector cubes and how that influences certain aspects of the implementation: #306 (comment) I couldn't find any other open points, but I could have missed some. |
} | ||
}, | ||
{ | ||
"name": "mtry", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeoPySpark -> max_bins
scikit-learn -> max_features
# Conflicts: # meta/subtype-schemas.json
I don't know if this is the right place to post @m-mohr. Anyway, after a discussion with @ValentinaHutter @LukeWeidenwalker and @mattia6690 we concluded that:
We will proceed in implementing a draft version of the process supporting this via an additional property in the parameters. |
@clausmichele Please open a new issue or PR :-) |
The new RF processes, in a draft state. Descriptions still need to be improved a bit and we may want to consider merging them with #304