Skip to content

Commit e095d57

Browse files
authored
Merge pull request #141 from PecanProject/mdietze-patch-1
Update gsoc_ideas.mdx
2 parents cf14128 + 31360de commit e095d57

1 file changed

Lines changed: 100 additions & 81 deletions

File tree

src/pages/gsoc_ideas.mdx

Lines changed: 100 additions & 81 deletions
Original file line numberDiff line numberDiff line change
@@ -1,75 +1,99 @@
11
---
2-
title: 'GSoC 2025 - PEcAn Project Ideas'
2+
title: 'GSoC 2026 - PEcAn Project Ideas'
33
---
44

55
# GSoC - PEcAn Project Ideas{#background}
66

7-
PEcAn is an open-source ecosystem modeling framework integrating data, models, and uncertainty quantification. Below is a list of potential ideas where contributors can help improve and expand PEcAn. To get started contributing to PEcAn, check out [this guide](https://github.com/PecanProject/pecan/discussions/3469). Come find us on Slack to discuss. If you have questions or would like to propose your own idea, contact @kooper in or join our **[#gsoc-2025](https://pecanproject.slack.com/archives/C0853U6GF71)** channel in Slack!
7+
PEcAn is an open-source ecosystem modeling framework integrating data, models, and uncertainty quantification. Below is a list of potential ideas where contributors can help improve and expand PEcAn. To get started contributing to PEcAn, check out [this guide](https://github.com/PecanProject/pecan/discussions/3469). Come find us on Slack to discuss. If you have questions or would like to propose your own idea, contact @kooper in or join our **[#gsoc](https://pecanproject.slack.com/archives/C0853U6GF71)** channel in Slack!
88

99
---
1010

1111
## Project Ideas{#ideas}
1212

1313
Below is a list of project ideas. Feel free to contact the listed mentors on Slack to discuss further or contact @kooper with new ideas and he can help connect you with mentors.
1414

15-
1. [Global Sensitivity Analysis / Uncertainty Partitioning](#sa)
16-
2. [Parallelization of Model Runs on HPC](#hpc)
17-
3. [Database and Data Improvements](#db)
18-
4. [Development of Notebook-based PEcAn Workflows](#notebook)
19-
5. [Refactoring Compile-time Flags to Runtime Flags in SIPNET](#sipnet)
15+
1. [Refactor and Parallelize Input Processing Pipelines](#input)
16+
2. [Benchmarking and Validation Framework](#validation)
17+
3. [Increase PEcAn modularity](#module)
18+
4. [Standardizing Model Couplers Across Models](#couplertools)
2019

2120
---
2221

23-
### 1. Global Sensitivity Analysis / Uncertainty Partitioning{#sa}
22+
### 1. Refactor and Parallelize Input Processing Pipelines{#input}
2423

25-
This project would extend PEcAn's existing uncertainty partitioning routines, which are primarily one-at-a-time and focused on model parameters, to also consider ensemble-based uncertainties in other model inputs (meteorology, soils, vegetation, phenology, etc). This project would employ Sobol' methods and some uncommitted code exists that manually prototyped how this would be done in PEcAn. The goal would be to refactor/reimplement this prototype into a reliable, automated system and apply it to some key test cases in both natural and managed ecosystems.
24+
Input-processing code in PEcAn (e.g., meteorological preparation) is currently centered around monolithic orchestration functions such as `do.conversions` and `met.process`. These functions mix low-level data transformations with sequential control flow, implicit dependencies, and caching behavior, making them difficult to test, debug, scale, or parallelize across sites and ensemble members.
2625

26+
This project will deprecate `do.conversions` as currently implemented and replace it with input preprocessing workflows that are explicitly structured around data dependencies and are naturally parallelizable across data streams, sites, and ensemble members. The work will refactor or deprecate `met.process` to remove monolithic orchestration and reduce or eliminate opaque caching, while retaining and strengthening existing low-level transformation functions.
27+
28+
As part of the refactor, orchestration logic should be rebuilt to make inputs, outputs, and dependencies explicit. A workflow tool such as `targets` may be used to help define and validate the dependency graph and caching behavior, but must not become a required or exclusive execution path for PEcAn.
29+
30+
This refactoring should also reduce or eliminate implicit dependencies on the global settings object (see Project 3), enabling clearer APIs and improved testability.
2731

2832
**Expected outcomes:**
2933

3034
A successful project would complete the following tasks:
3135

32-
* Reliable, automated Sobol sensitivity analyss and uncertainty partitioning across multiple model inputs.
33-
* Applications to test case(s) in natural and / or managed ecosystems.
36+
* Deprecation plan for do.conversions, with a replacement that provides a modular suite of preprocessing tools that
37+
* explicitly defines inputs and outputs, and
38+
* supports parallel execution across products, sites, and ensemble members.
39+
Key here is a high-level plan for development that will continue beyond what is accomplished this summer.
40+
41+
* Refactor and/or deprecation plan for met.process that:
42+
* removes monolithic orchestration and hidden control flow,
43+
* reduces or eliminates over-engineered caching,
44+
* retains and documents low-level transformation functions.
45+
46+
* Demonstration of parallel execution on a multi-site or multi-ensemble example.
47+
48+
* Basic correctness and performance benchmarks, including unit and integration tests and validation of PEcAn-standard inputs (formats and units).
49+
50+
* Updated developer documentation covering:
51+
* the new input-processing architecture,
52+
* how to add a new preprocessing step,
53+
* migration guidance from legacy entry points.
54+
3455

3556
**Prerequisites:**
3657

3758
- Required: R (existing workflow and prototype is in R)
38-
- Helpful: familiarity with sensitivity analyses
59+
- Helpful: familiarity with parallel computing, workflow refactoring
3960

4061
**Contact person:**
4162

42-
Mike @Dietze
63+
David LeBauer (@dlebauer), @Henry Priest
4364

4465
**Duration:**
4566

46-
Flexible to work as either a Medium (175hr) or Large (350 hr)
67+
Large (350 hr)
4768

4869
**Difficulty:**
4970

50-
Medium
71+
High
5172

5273
---
5374

54-
### 2. Parallelization of Model Runs on HPC{#hpc}
75+
### 2. Benchmarking and Validation Framework{#validation}
5576

56-
This project would extend PEcAn's existing run mechanisms to be able to run on a High Performance Compute cluster (HPC) using [Apptainer](https://apptainer.org). For uncertaintity analysis, PEcAn will run the same model 1000s of times with small permutations. This is a perfect use for an HPC run. The goal is to not submit 1000s of jobs, but have a single job with multiple nodes that will run all of the ensembles efficiently. Running can be orchistrated using RabbitMQ, but other methods are also encouraged. The end goal should be for the PEcAn system to be launched, and run the full workflow on the HPC from start to finish leveraging as many nodes as it is given during the submission.
77+
A key task in any modeling workflow is the validation of model outputs against held out observations. When a validation dataset is used repeatedly and agreed upon by a broad community to have particular value in assessing model performance it often gets elevated to the status of a persistent "benchmark" dataset. In PEcAn, there is a need to replace our earlier benchmarking module, whose design was never fully implemented, with a simpler framework. In designing this framework we'd encourage participants to build upon the existing low-level infrastructure in the existing benchmarking module for model-data alignment tools and comparison metrics like RMSE, MAE, and R2. Work should also build upon and generalize existing examples of "one off" validation scripts (e.g., CARB cropland validations, North American data assimilation validations).
5778

5879
**Expected outcomes:**
5980

6081
A successful project would complete the following tasks:
6182

62-
* Show different ways to launch jobs (rabbitmq, lock files, simple round robin, etc)
63-
* Report of different options and how they can be enabled.
83+
* A high-level design and plan for development that will continue beyond what is accomplished this summer
84+
* Unit and integration tests
85+
* A generalized example of a validation workflow and/or notebook using California cropland datasets spanning multiple sites and crop types.
86+
* Documentation
87+
6488

6589
**Prerequisites:**
6690

67-
- Required: R (existing workflow and prototype is in R), Docker
68-
- Helpful: Familiarity with HPC and Apptainer
91+
- Required: R (existing workflow and prototype is in R), familiarity with statistical methods for model validation
92+
- Helpful: Familiarity with existing benchmarking workflow systems
6993

7094
**Contact person:**
7195

72-
Rob @Kooper
96+
Chris Black (@infotroph)
7397

7498
**Duration:**
7599

@@ -80,113 +104,108 @@ Flexible to work as either a Medium (175hr) or Large (350 hr)
80104
Medium
81105

82106
---
83-
### 3. Database and Data Improvements{#db}
84-
85-
PEcAn relies on the BETYdb database to store trait and yield data as well as model provenance information. This project aims to separate trait data from provenance tracking, ensure that PEcAn is able to run without the server currently required to run the Postgres database used by BETYdb, and enable flexible data sharing in place of a server-reliant sync mechanism. The goal is to make PEcAn workflows easier to test, deploy, and use while also making data more accessible.
86-
107+
### 3. Increase PEcAn modularity{#module}
87108

88-
**Potential Directions**
109+
Existing PEcAn workflows rely heavily on reading a large `settings` object and writing .RData files or other opaque artifacts to disk to pass state between steps. This behavior reduces transparency, testability, and user understanding. The high-level goal of this project is to make PEcAn’s core functionality more modular and transparent, so that users can more easily build, maintain, and expand PEcAn workflows.
89110

90-
- **Minimal BETYdb Database:** Create a simplified version of BETYdb for demonstrations and Integration tests, which might include:
91-
- Review the provenance information we currently log, identify components that no longer need to be tracked or that should be temporary rather than permanent records, and build tools to clean unneeded records from the database.
92-
- Design and create a freestanding version of the trait data, including choosing the format and distribution method, implementing whatever pipelines are needed to move the data over, and documenting how to use and update the result.
93-
- Review the information we currently log, identify components that no longer need to be tracked or that should be temporary rather than permanent, and build tools to clean unneeded/expired records from the database.
111+
This project refactors a single, well-defined workflow so that functions return explicit R objects (e.g., data frames or lists) instead of relying on hidden on-disk side effects.
94112

95-
- **Non-Database Setup:** Enable workflows that do not require PostgreSQL or a web front-end, potentially including:
96-
- Identify PEcAn modules that are still DB-dependent and refactor them to allow freestanding use
97-
- Implement mechanisms for decoupling the DB from the model pipelines in time and space while still tracking provenance. Perhaps this could involve separate prep/execution/post-logging phases, but we encourage your creative suggestions.
98-
- Create tools that maximize interoperability with data from other sources, including from external databases or the user's own observations.
99-
- Identify functionality from the "BETYdb network" sync system that is out of date and replace or remove it as needed.
113+
To minimize disruption with existing workflows, the preferred approach would be:
114+
* To begin by documenting existing functionality
115+
* Where needed, write tests for existing functionality
116+
* Document new functionality
117+
* Write tests for new functionality (TDD)
118+
* Refactoring of functions to return objects
119+
* Then refactor downstream functions use those objects
120+
* Only after that’s working, stop writing out the files.
121+
* If time permits, analyze how PEcAn's high-level modules are using the `settings` object and, where possible, refactor function inputs to only pass the required subset of variables or variable lists.
122+
* Along the way, it would also be beneficial to reassess which functions need to be exported, with the idea that fewer exported functions would make it easier for new users to see what PEcAn’s core modules actually are, and better documenting the core functions and modules we expect users to need to learn/use
100123

101124
**Expected outcomes**:
102125

103-
A successful project would complete a subset of the following tasks:
104-
- A lightweight, distributable demo Postgres database.
105-
- A distributable dataset of the existing trait and yield records in a maximally reusable format (i.e. maybe _not_ Postgres)
106-
- A workflow that is independent of the Postgres database.
126+
* Refactored functions that return explicit R objects instead of writing .RData
127+
* Clear definition and doucmentation of object structures passed between steps
128+
* Backward-compatible wrappers where needed to avoid breaking existing workflows
129+
* Unit tests that no longer depend on on-disk state or output_dir
130+
* Documentation describing .RData deprecation, migration guidance, and examples
131+
107132

108133
**Skills Required**:
109134

110-
- Familiarity with database concepts required
111-
- Postgres experience helpful (and required if proposing DB cleanup tasks)
112-
- R experience helpful (and required if proposing PEcAn code changes)
135+
- Required: R (existing workflow and prototype is in R) and R package development
136+
- Helpful: familiarity with code refactoring
113137

114138
**Contact person:**
115139

116-
Chris Black (@infotroph)
140+
Mike @Dietze
117141

118142
**Duration:**
119143

120144
Suitable for a Medium (175hr) or Large (350 hr) project.
121145

122146
**Difficulty:**
123147

124-
Intermediate to hard
125-
148+
Medium
126149

127150
---
151+
### 4. Standardizing Model Couplers Across Models{#couplertools}
128152

129-
### 4. Development of Notebook-based PEcAn Workflows{#notebook}
130-
131-
The PEcAn workflow is currently run using either a web based user interface, an API, or custom R scripts. The web based user interface is easiest to use, but has limited functionality whereas the custom R scripts and API are more flexible, but require more experience.
153+
PEcAn models frequently duplicate similar logic for writing configuration files, translating meteorological inputs, and handling model-specific I/O. This copy–paste pattern increases maintenance cost and makes it harder to integrate new models consistently.
132154

133-
This project will focus on building Quarto notebooks that provide an interface to PEcAn that is both welcoming to new users and flexible enough to be a starting point for more advanced users. It will build on existing [Pull Request 1733](https://github.com/PecanProject/pecan/pull/1733).
155+
This project identifies a small set of shared configuration and I/O patterns and refactors them into documented helper functions with well-defined interfaces. Possible examples include netCDF reading/writing, parsing standardized input files, test fixtures, settings validation, and others. The approach should be demonstrated across a limited number of models coupler packages under active development.
134156

135157
**Expected outcomes:**
158+
A successful project will deliver an inventory of duplicated configuration and I/O patterns along with one or more of the following steps toward deduplication:
136159

137-
- Two or more template workflows for running the PEcAn workflow.
138-
- Written vignette and video tutorial introducing their use.
160+
* Shared helper functions with explicit inputs, outputs, and unit conventions
161+
* Refactored model code using standardized helpers
162+
* Unit tests ensuring consistent behavior across models
163+
* Updated developer documentation describing standard interfaces and recommended usage
139164

140165
**Prerequisites:**
141-
142-
- Familiarity with R.
143-
- Familiarity with R studio and Quarto or Rmarkdown is a plus.
166+
Required: Proficiency in R
167+
Helpful: experience with unit testing
144168

145169
**Contact person:**
146-
David LeBauer @dlebauer, Nihar Sanda @koolgax99
170+
Chris Black, @infotroph
147171

148-
**Duration:**
149-
Medium (175hr)
172+
**Duration:**:
173+
Medium (175hr) or Large (350 hr) depending on number of deliverables
150174

151175
**Difficulty:**
152176
Medium
177+
<!--
153178

154179

155-
### 5. Refactoring Compile-time Flags to Runtime Flags in SIPNET{#sipnet}
156-
157-
**Project Overview**
158-
159-
The ecosystem SIPNET is a core component of many PEcAn analyses. SIPNET is compiled with multiple compile-time flags that control whether different features are turned on and off. Thus, as currently configured, each model structure requires a separate compiled binary.
160-
161-
This project will refactor these flags to be runtime-configurable via command-line arguments or a configuration file, improving usability and testing efficiency.
180+
# This comment section for ideas that may be potentially viable in future (with revision)
162181

163-
**Expected Outcomes**
164182

165-
- Convert selected SIPNET compile-time flags to runtime options.
166-
- Develop a global configuration object for managing runtime flags.
167-
- Improve testability by enabling different configurations without recompiling.
183+
---
168184

169-
**Prerequisites**
185+
### 4. Development of Notebook-based PEcAn Workflows{#notebook}
170186

171-
- Required: C, experience with compilers and build systems.
172-
- Helpful: Understanding of simulation models.
187+
The PEcAn workflow is currently run using either a web based user interface, an API, or custom R scripts. The web based user interface is easiest to use, but has limited functionality whereas the custom R scripts and API are more flexible, but require more experience.
173188

174-
**Mentor(s)**
189+
This project will focus on building Quarto notebooks that provide an interface to PEcAn that is both welcoming to new users and flexible enough to be a starting point for more advanced users. It will build on existing [Pull Request 1733](https://github.com/PecanProject/pecan/pull/1733).
175190

176-
- David LeBauer (@dlebauer)
177-
- Mike Longfritz
191+
**Expected outcomes:**
178192

179-
**Duration**
180-
- Medium (175hr) or Large (350hr)
193+
- Two or more template workflows for running the PEcAn workflow.
194+
- Written vignette and video tutorial introducing their use.
181195

182-
**Difficulty**
183-
- Medium to Hard
196+
**Prerequisites:**
184197

198+
- Familiarity with R.
199+
- Familiarity with R studio and Quarto or Rmarkdown is a plus.
185200

186-
<!--
201+
**Contact person:**
202+
David LeBauer @dlebauer, Nihar Sanda @koolgax99
187203

204+
**Duration:**
205+
Medium (175hr)
188206

189-
# This comment section for ideas that may be potentially viable in future (with revision)
207+
**Difficulty:**
208+
Medium
190209

191210

192211
#### BETYdb R data package

0 commit comments

Comments
 (0)