Question about calkit with multiple subprojects #359

rebeccamccabe · 2025-06-11T14:55:14Z

rebeccamccabe
Jun 11, 2025
Collaborator

What is the intended use of calkit for multiple projects with partially overlapping dependencies? For example, a dissertation.tex that only I need access to, and it contains project_A.tex, project_B.tex, and project_C.tex, each of which have their own github repo and overleaf doc with distinct collaborators, but project_A uses some code from project_B via submodule?

petebachant · 2025-06-11T17:17:18Z

petebachant
Jun 11, 2025
Maintainer

That's a good question. As of now, no subproject functionality has been built yet, though I started to try to make my own PhD work into a Calkit project. I would lean towards using a monorepo with no submodules, but during my PhD I created a new repo for each experiment, simulation campaign, paper, and one for the dissertation, so that seems to lend itself to subprojects with submodules. They all had overlapping dependencies as you say, and I did some wacky (non-reproducible) stuff to copy things back and forth when they something was modified.

The overarching philosophy is that everything should be able to be reproduced with a single command. I could imagine something like this that uses submodules:

# calkit.yaml for the "super project"
name: my-phd
title: Investigating a thing
subprojects:
  - path: project-a
  - path: project-b
pipeline:
  stages:
    copy-files-from-project-a:
      kind: shell-command
      command: cp project-a/somefile.tex project-a.tex
      inputs:
        - project-a/somefile.tex
      outputs:
        - path: project-a.tex
          storage: null # Keep out of version control here since it's already in project A
    copy-files-from-project-b:
      kind: shell-command
      command: cp project-a/somefile.tex project-b.tex
      inputs:
        - project-b/somefile.tex
      outputs:
        - path: project-b.tex
          storage: null
    build-thesis:
      kind: latex
      tex_file_path: thesis.tex
      inputs:
        - from_stage_outputs: copy-files-from-project-a
        - from_stage_outputs: copy-files-from-project-b

When calkit run is called, the pipelines in all subprojects are run (may need to disallow dependencies between subprojects, i.e., only allow the super project to depend on the subprojects), and then the super project is run, so everything is guaranteed to be consistent.

What do you think? Do you have an ideal workflow in mind? Are project A and B done now, so all you need to work on is the dissertation, but you'd like to collect up all of the publications, code, figures, data, etc., into one project so they can be consumed by others in one place, sort of like a bundle of all of your PhD research? Or are you mostly concerned with syncing text from the papers into the disseration?

If you feel like sharing the projects, I'd be happy to take a look and see if I can build something that helps make things easier.

0 replies

rebeccamccabe · 2025-06-16T14:41:13Z

rebeccamccabe
Jun 16, 2025
Collaborator Author

In my case:

project_A is https://github.com/symbiotic-engineering/MDOcean (matlab, both model and optimization)
project_B is https://github.com/symbiotic-engineering/OpenFLASH (python and matlab, just model)
project_C is basically rerunning the matlab optimization of project_A but using a new data input. The data input is obtained via running a separate optimization in a third repo which is in development and not public yet (mainly python, bit of julia, using several external packages). I plan for the figures and text of project_C to live in the MDOcean repo, but the code for the sub-optimization that produces the data will probably stay in the separate third repo, because people would likely want to run that code separately from the outer optimization.

The submodule of project B inside project A isn't set up yet: currently MDOcean/mdocean/simulation/modules/MEEM and OpenFLASH/hydro/matlab contain duplicated code that is a bit out of sync, but planning to fix this soon. Project A and B we definitely plan to keep in separate repos because OpenFlash is a toolbox intended for public use that we recently released a python package for, whereas MDOcean is right now closer to a one-off study but I intend to get to a place where it's usable by others too.

One thing that I suspect would be important for reproducibility is to allow different super projects to point to different commits of the sub projects, since the sub projects change over time. Because if someone is trying to reproduce an optimization paper, they would presumably both like to be able to exactly reproduce what is published (ie using model v1), and rerun the optimization using model v2 if a recent model enhancement has been made, and the dissertation might use slightly different versions of the text than the publications for each project do. It seems like that would be very difficult with monorepo but easy with submodules. That also seems to suggest that perhaps the MDOcean model and optimization should be in separate repos, which is not currently the case.

All the projects are still changing. The consolidation of dissertation writing will start in a few months, and my goal would be to sync figures and data as well as text. Thanks for your thoughts.

0 replies

rebeccamccabe · 2025-06-16T23:55:35Z

rebeccamccabe
Jun 16, 2025
Collaborator Author

I set up the submodule (symbiotic-engineering/MDOcean#97) and updated the calkit and dvc yaml files with the submodule paths (without any file copying as suggested). Then calkit run failed on the first attempt, but worked when I reran it after making no changes. It seems like copying the files is not necessary as long as the submodules are properly checked out, but I might be misunderstanding your workflow suggestion.

$ calkit run
Checking system-level dependencies
Running stage 'build-paper':
> calkit xenv -n tex -- latexmk -cd -silent -interaction=nonstopmode -pdf paper/mdocean.tex
Rc files read:
  NONE
Latexmk: Run number 1 of rule 'bibtex mdocean'
I found no \citation commands---while reading file mdocean.aux
I found no \bibdata command---while reading file mdocean.aux
I found no \bibstyle command---while reading file mdocean.aux
(There were 3 error messages)
Collected error summary (may duplicate other messages):
  bibtex mdocean: Bibtex errors: See file 'mdocean.blg'

Latexmk: Sometimes, the -f option can be used to get latexmk
  to try to force complete processing.
  But normally, you will need to correct the file(s) that caused the
  error, and then rerun latexmk.
  In some cases, it is best to clean out generated files before rerunning
  latexmk after you've corrected the files.
Error: Failed to run in Docker environment
ERROR: failed to reproduce 'build-paper': failed to run: calkit xenv -n tex -- latexmk -cd
-silent -interaction=nonstopmode -pdf paper/mdocean.tex, exited with 1
Error: DVC pipeline failed

Running again:

$ calkit run
Checking system-level dependencies
Running stage 'build-paper':
> calkit xenv -n tex -- latexmk -cd -silent -interaction=nonstopmode -pdf paper/mdocean.tex
Rc files read:
  NONE
Latexmk: Run number 1 of rule 'pdflatex'
This is pdfTeX, Version 3.141592653-2.6-1.40.27 (TeX Live 2025) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
Latexmk: Getting log file 'mdocean.log'
Latexmk: Run number 1 of rule 'bibtex mdocean'
Warning--I'm ignoring mccabe_open-source_2024's extra "doi" field
--line 1383 of file references.bib
Warning--I didn't find a database entry for "PBE"
Warning--I didn't find a database entry for "agteMDOAssessmentDirection2009"
Warning--I didn't find a database entry for "martinsMultidisciplinaryDesignOptimization2013
"
Warning--to sort, need author or key in noauthor_system_2020
Warning--empty booktitle in chau_inertia_2012
Warning--empty booktitle in mccabe_multidisciplinary_2022
Warning--empty pages in mccabe_multidisciplinary_2022
Warning--empty journal in herber_dynamic_2014
Warning--empty pages in anderson_re-imagining_2024
Warning--empty booktitle in chau_inertia_2013
Warning--empty booktitle in mccabe_open-source_2024
Warning--empty note in mccabe_investigating_2025
Warning--empty note in khanal_openflash_2025
Warning--empty booktitle in philip_damping_2012
Warning--empty pages in philip_damping_2012
Warning--empty booktitle in gaebele_incorporating_2023
Warning--empty pages in gaebele_incorporating_2023
Warning--empty booktitle in mccabe_system_2023
Warning--empty pages in mccabe_system_2023
Warning--empty booktitle in chau_inertia_2010
Warning--empty pages in chau_inertia_2010
(There were 22 warnings)
Latexmk: Run number 2 of rule 'pdflatex'
This is pdfTeX, Version 3.141592653-2.6-1.40.27 (TeX Live 2025) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
Latexmk: Getting log file 'mdocean.log'
Latexmk: Run number 2 of rule 'bibtex mdocean'
Warning--I'm ignoring mccabe_open-source_2024's extra "doi" field
--line 1383 of file references.bib
Warning--I didn't find a database entry for "PBE"
Warning--I didn't find a database entry for "agteMDOAssessmentDirection2009"
Warning--I didn't find a database entry for "martinsMultidisciplinaryDesignOptimization2013
"
Warning--to sort, need author or key in noauthor_system_2020
Warning--empty booktitle in chau_inertia_2012
Warning--empty booktitle in mccabe_multidisciplinary_2022
Warning--empty pages in mccabe_multidisciplinary_2022
Warning--empty journal in herber_dynamic_2014
Warning--empty pages in anderson_re-imagining_2024
Warning--empty booktitle in chau_inertia_2013
Warning--empty booktitle in mccabe_open-source_2024
Warning--empty note in mccabe_investigating_2025
Warning--empty note in khanal_openflash_2025
Warning--empty booktitle in philip_damping_2012
Warning--empty pages in philip_damping_2012
Warning--empty booktitle in gaebele_incorporating_2023
Warning--empty pages in gaebele_incorporating_2023
Warning--empty booktitle in mccabe_system_2023
Warning--empty pages in mccabe_system_2023
Warning--empty booktitle in chau_inertia_2010
Warning--empty pages in chau_inertia_2010
(There were 22 warnings)
Latexmk: Run number 3 of rule 'pdflatex'
This is pdfTeX, Version 3.141592653-2.6-1.40.27 (TeX Live 2025) (preloaded format=pdflatex)
 restricted \write18 enabled.
entering extended mode
Latexmk: Getting log file 'mdocean.log'
Latexmk: Summary of warnings from last run of *latex:
  Latex failed to resolve 4 reference(s)
  Latex found 1 multiply defined reference(s)
  Latex failed to resolve 3 citation(s)
Latexmk: ====Undefined refs and citations with line #s in .tex file:
  Label `standard' multiply defined
  Citation `PBE' on page 10 undefined on input line 9
  Citation `agteMDOAssessmentDirection2009' on page 22 undefined on input line 314
  Citation `martinsMultidisciplinaryDesignOptimization2013' on page 22 undefined on input l
ine 315
  Reference `eq:LCOE' on page 63 undefined on input line 19
  Reference `eq:spar' on page 64 undefined on input line 75
  Reference `eq:spar' on page 64 undefined on input line 76
 And 1 more --- see log file 'mdocean.log'

Updating lock file 'dvc.lock'

To track the changes with git, run:

        git add dvc.lock

To enable auto staging, run:

        dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calkit

Question about calkit with multiple subprojects #359

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Calkit

Question about calkit with multiple subprojects #359

Uh oh!

rebeccamccabe Jun 11, 2025 Collaborator

Replies: 3 comments

Uh oh!

Uh oh!

petebachant Jun 11, 2025 Maintainer

Uh oh!

rebeccamccabe Jun 16, 2025 Collaborator Author

Uh oh!

rebeccamccabe Jun 16, 2025 Collaborator Author

rebeccamccabe
Jun 11, 2025
Collaborator

petebachant
Jun 11, 2025
Maintainer

rebeccamccabe
Jun 16, 2025
Collaborator Author

rebeccamccabe
Jun 16, 2025
Collaborator Author