Skip to content

Commit

Permalink
Merge pull request #73 from PecanProject/gsoc2025
Browse files Browse the repository at this point in the history
[WIP] Update 2025 GSOC ideas
  • Loading branch information
robkooper authored Feb 11, 2025
2 parents 72c5d87 + 9fc6daa commit c971cd3
Showing 1 changed file with 63 additions and 85 deletions.
148 changes: 63 additions & 85 deletions src/pages/gsoc_ideas.mdx
Original file line number Diff line number Diff line change
@@ -1,102 +1,127 @@
---
title: 'GSoC 2024 - PEcAn Project Ideas'
title: 'GSoC 2025 - PEcAn Project Ideas'
---

# [GSoC - PEcAn Project Ideas](#background)

Ecosystem science has many components, so does PEcAn! Some of those components where you can contribute. Below is a list of potential ideas. Feel free to contact any of the mentors in slack, or feel free to ask questions in our #gsoc-2024 channel in slack.
Ecosystem science has many components, so does PEcAn! Some of those components where you can contribute. Below is a list of potential ideas. Feel free to contact any of the mentors in slack, or feel free to ask questions in our #gsoc-2025 channel in slack.

---

## [Project Ideas](#ideas)

Following is a list of project ideas, use this list to contact the appropriate mentors on slack. Feel free to propose your own ideas as well, in this case contact @kooper in slack so he can put you in contact with the best mentors.
Following is a list of project ideas, use this list to contact the appropriate mentors on slack. Feel free to propose your own ideas as well, in this case contact @kooper in Slack so he can put you in contact with the best mentors.

---

#### [Machine Learning downscaling of PEcAn outputs](#ml)
#### [Global sensitivity analysis / uncertainty partitioning](#sa)

This project would extend an existing prototype that takes ensemble-based outputs from the process-based PEcAn models (and the data assimilation code in particular) and use ML models to make predictions to new locations where the PEcAn models were not run (a.k.a. downscaling). Existing code downscales the low-frequency (monthly to annual) carbon pool outputs using a random forest model and a harmonized stack of gridded spatial data (climate, land use/land cover, soils, topography). The current system also preserves the covariance structure across variables, space, and time by downscaling each model ensemble member separately and then using the downscaled ensemble to calculate summary statistics. Also included are some basic assessments of (cross-)validation skill and variable importance.
This project would extend PEcAn's existing uncertainty partitioning routines, which are primarily one-at-a-time and focused on model parameters, to also consider ensemble-based uncertainties in other model inputs (meteorology, soils, vegetation, phenology, etc). This project would employ Sobol' methods and some uncommitted code exists that manually prototyped how this would be done in PEcAn. The goal would be to refactor/reimplement this prototype into a reliable, automated system and apply it to some key test cases in both natural and managed ecosystems.

**Expected outcome:**

**Expected outcomes:**

A successful project would complete at subset of the following tasks:

1. Extend the code to downscale higher-frequency (hourly to daily) carbon flux outputs
2. Develop tools for aggregating downscaled outputs to user-specified spatial units (e.g., political boundaries, atmospheric model grid cells)
3. Explore alternative ML models and multi-model ensembles.
4. Extend the set of covariate data to make use of time-varying inputs (e.g. that year’s weather rather than the climatological mean), additional remotely sensed observations, and the previous ecosystem state.
5. Improving the downscaling validation checks, potentially adding additional corrections to the computed uncertainties (current prototype tool tends to underpredict the ensemble spread).
* Reliable, automated Sensitivity analyss and uncertainty partitioning
* Applications to test case(s) in natural and / or managed ecosystems.

**Prerequisites:**

- Required: R (existing prototype is in R); basic familiarity with ML techniques and packages
- Helpful: familiarity with large spatial gridded data (e.g., GIS, R terra, remote sensing); more advanced statistics, ML, or data science; Python
- Required: R (existing workflow and prototype is in R)
- Helpful: familiarity with sensitivity analyses

**Contact person:**

Mike @Dietze

**Duration:**
Size: 175 hours for 1-2 tasks, 350 hours for 3 or more tasks

Flexible to work as either a Medium (175hr) or Large (350 hr)

**Difficulty:**

Medium

---

#### [Adopting data schema for field management events](#management)
#### [Parallelization of runs](#hpc)

This project aims to adapt a data schema for an R shiny application called fieldactivity. Fieldactivity is an application that allows field operators and researchers to enter field information about management activities through UI to aid bookkeeping of such events. The management activities and associated information are then stored in json files from which the information can be used for modelling.
This project would extend PEcAn's existing run mechanisms to be able to run on an HPC using apptainer. For uncertaintity analysis, PEcAn will run 1000s of runs of the same model with small permutations. This is a perfect use for an HPC run. The goal is to not submit 1000s of jobs, but have a single job with multiple nodes that will run all of the ensembles efficiently. Running can be orchistrated using RabbitMQ but other methods are encouraged as well. The end goal should be for the PEcAn system to be launched, and run the full workflow on the HPC from start to finish leveraging as many nodes as given during the submission.

The fieldactivity application uses UI elements that are created with RShiny and therefore follows the R coding conventions. At the moment, to meet these R coding criteria, the data structure is read from a json file called ui_structure_json, which contains the necessary attributes to create the UI with R. As this json file is independent and does not communicate with any other data sources, it must be manually updated if the data requirements are to be kept up to date with other data sources. To overcome the potential differences between the data sources, we have created a json data schema ([management-event.schema.json](https://github.com/hamk-uas/fieldobservatory-data-schemas/blob/main/management-event.schema.json)) to act as a single source of truth for different data sources. The GSoC task is to incorporate this schema into the fieldactivity shiny app such that it can read the variable information from the schema and store the data in the correct structure. In addition, the app should be made flexible such that when a change is made to the json schema, it can deploy and change / create UI elements accordingly on the fly. To achieve this, the functionalities around how the applications store the data need to be reconstructed.
**Expected outcomes:**

**Expected outcome:**

The project can be divided to following subtasks:
A successful project would complete at subset of the following tasks:

1. The fieldactivity application will be able to handle/read the data, which have been stored in the current way or structured according to the management data schema.
2. The data storage convention will be changed for those management cases, where it is possible to store multiple incidents at once. Currently these cases are stored in a list in a format that the data schema doesn’t support.
3. Include the data schema as part of the fieldactivity code:
- Variable names and metadata are read from the data schema. This also requires translation of the data schema information so that UI elements can be created in R Shiny.
- Stored data follows the structure and the names of the data schema.
* Show different ways to launch the jobs (rabbitmq, lock files, simple round robin, etc)
* Report of different options and how they can be enabled

**Prerequisites:**

- Required: R and RShiny, json
- Required: R (existing workflow and prototype is in R), docker
- Helpful: familiarity with HPC and apptain

**Contact person:**
Henri Kajasilta

Rob @Kooper

**Duration:**
Flexible to work as either a Small (175hr) or Large (350 hr)

Flexible to work as either a Medium (175hr) or Large (350 hr)

**Difficulty:**

Medium

---
#### [Database Improvements](#db)

**Chris TODO**
- decouple traits from provenance
- make betydb.org data available through R package



**Contact person:**
Chris Black (@infotroph)

**Duration:**
Flexible to work as either a Medium (175hr) or Large (350 hr)

**Difficulty:**
Medium, Large

---

#### [Development of Notebook-based PEcAn Workflows](#notebook)

#### [PEcAn Code Hardening by Integration Testing](#testing)
The PEcAn workflow is currently run using either a web based user interface, an API, or custom R scripts. The web based user interface is easiest to use, but has limited functionality whereas the custom R scripts and API are more flexible, but require more experience.

The proposed project aims to enhance the reliability of PEcAn's integration tests by prioritizing packages associated with overall workflow bottlenecks. The focus will be on preparing contributors to gain an in-depth understanding of PEcAn's inner workings and the interactions between modules. It will commence with prioritizing basic runs to establish a robust foundation that include single site, single model runs to cover the major models. Subsequently, attention will shift towards ensemble runs, diversifying testing scenarios to ensure comprehensive coverage. A specific emphasis will be placed on Data Simulation models for single site, single model runs, with a focus on prominent models. This initiative aims to provide contributors with a holistic perspective on PEcAn's functionality, fostering a deeper understanding of how individual modules contribute to the overall workflow. By combining these elements, the GSoC project seeks to create a structured and immersive learning experience that equips participants to contribute effectively to PEcAn's development while addressing critical workflow bottlenecks.
This project will focus on building Quarto workflows aimed at providing an interface to PEcAn that is both welcoming to new users and flexible enough to be a starting point for more advanced users. It will build on existing [Pull Request 1733](https://github.com/PecanProject/pecan/pull/1733).

**Expected outcome:**

- Increased module and model coverage in PEcAn’s automated integration tests; contributors can understand which components are and are not covered by existing tests.
- Two or more template workflows for running the PEcAn workflow. Written vignette and video tutorial introducing their use.

**Prerequisites:**

- R
- Familiarity with R. Familiarity with R studio and Quarto or Rmarkdown is a plus.

**Contact person:**
Chris Black (@infotroph), Shashank Singh (@moki1202)
David LeBauer @dlebauer, Nihar Sanda @koolgax99

**Duration:**
Flexible to work as either a Small (175hr) or Large (350 hr)
Medium (175hr)

**Difficulty:**
Medium, Large
Medium

---


<!--


# This comment section for ideas that may be potentially viable in future (with revision)

#### [Optimize PEcAn for freestanding use of single packages [R package development]](#freestanding)

Expand Down Expand Up @@ -124,12 +149,11 @@ Flexible to work as either a Small (175hr) or Large (350 hr)

**Difficulty:**
Medium, Large

---

#### [PEcAn model coupling and development [Data Science]](#coupling)

PEcAn has the capability to interface multiple ecological models. The goal of this project is to improve the coupling of existing models to PEcAn (specifically FATES) and add new models (specifically a simple vegetation model that is under development). It is also possible to contribute to the development of the simple vegetation model which is written in fortran.
PEcAn has the capability to interface multiple ecological models. The goal of this project is to improve the coupling of existing models to PEcAn (specifically FATES) and add new models (specifically a simple vegetation model that is under development). It is also possible to contribute to the development of the simple vegetation model which is written in Fortran.

**Expected outcome:**

Expand All @@ -149,51 +173,5 @@ Flexible to work as either a Small (175hr) or Large (350 hr)
Medium

---
-->

#### [Development of Notebook-based PEcAn Workflows](#notebook)

The PEcAn workflow is currently run using either a web based user interface, an API, or custom R scripts. The web based user interface is easiest to use, but has limited functionality whereas the custom R scripts and API are more flexible, but require more experience.

This project will focus on building Quarto workflows aimed at providing an interface to PEcAn that is both welcoming to new users and flexible enough to be a starting point for more advanced users. It will build on existing [Pull Request 1733](https://github.com/PecanProject/pecan/pull/1733).

**Expected outcome:**

- Two or more template workflows for running the PEcAn workflow. Written vignette and video tutorial introducing their use.

**Prerequisites:**

- Familiarity with R. Familiarity with R studio and Quarto or Rmarkdown is a plus.

**Contact person:**
David LeBauer @dlebauer, Nihar Sanda @koolgax99

**Duration:**
Small (175hr)

**Difficulty:**
Medium

---

#### [PEcAn in the cloud](#cloud)

The PEcAn system is a complex system with many microservices such as the database system, frontend, models, job management etc. These microservices lend themselves to be deployed in the cloud. We have an existing helm chart that should get you most of the way there and should allow you to deploy pecan on kubernetes. Additionally there is a docker-compose file that should allow you to deploy PEcAn on a single server using docker.

This project will take the helm chart and docker-compose files and harden them and upgrade them to use the latest versions of containers. The current system uses the shared folder not only to deploy data in all services, but also uses it to let the central system know when executions are finished. We would like to move away from this shared system and use the message system to indicate executions are done, and use a file service to pull and push data (for example from/to S3).

**Expected outcome:**

- Updates to docker-compose and helm chart, as well as code submissions to mark executions as finished using RabbitMQ and file push/pull functionality when executing jobs.

**Prerequisites:**

- Familiarity with Kubernetes, Docker, Helm and R. Familiarity with RabbitMQ and postgreSQL is a plus

**Contact person:**
Rob Kooper @kooper, Samu Varjonen @samu, Istem Fer @istfer

**Duration:**
Large (350 hr)

**Difficulty:**
Medium

0 comments on commit c971cd3

Please sign in to comment.