Are you working reproducibly? Why or why not? #227

petebachant · 2025-01-29T14:02:53Z

petebachant
Jan 29, 2025
Maintainer

Working reproducibly can be loosely defined as:

Putting all files in version control.
Generating all artifacts with a pipeline where all steps are run in a clearly-defined computational environment.

What's your current workflow like?

rodrigo-pena · 2025-07-17T15:35:54Z

rodrigo-pena
Jul 17, 2025

Hi there, just stumbled upon calkit. It's a very interesting endeavor and I'll be following its development!

I have been reviewing the workflows in my projects this year, trying to adopt best practices from DevOps, MLOps and DataOps worlds. So far, everything that is strictly software related has been a breeze to implement (e.g. automatic checks, tests, doc-building on CI/CD). However I struggle with finding good solutions for maintaining and automating my data pipelines, especially because I work on an HPC environment.

Each of my projects will typically have a dataset living somewhere in the cluster filesystem. It's not always clear if the data will always live there, and therefore have restricted access, or if it can be packaged and published (e.g. to Zenodo) together with a publication. On top of that, a lot of the steps in my pipeline need to be scheduled to compute nodes via Slurm (so a CI/CD runner wouldn't have access to running these steps).

I generally like the approach of tools like DVC but I couldn't figure out yet how to make if work for my setup. In the forums the interfacing with Slurm systems seems to be a current pain point (e.g. treeverse/dvc#1057). And ideally I would like to avoid duplication of storage from dvc pull, since the "remote" lives in the cluster itself and users are charged for storage. So far, DVC's External Data doesn't seem to work (or at least I didn't manage to make it work).

Do you have any thoughts about this use case?

3 replies

petebachant Jul 17, 2025
Maintainer Author

Hi there @rodrigo-pena,

I do have some thoughts on the topic and have built a few experimental features towards a solution. Calkit's pipeline (which compiles to a DVC pipeline) has the concept of an SSH environment, which can be used to represent a cluster or any other remote machine. It doesn't currently pay attention to the Slurm scheduler, but it does deal with syncing files back and forth and allow for disconnecting and resuming the stage to wait for it to complete on the remote machine. The "UX" is something like:

# In calkit.yaml
environments:
  my-cluster:
    kind: ssh
    host: "10.225.22.25"
    user: my-user-name
    wdir: /home/my-user-name/my-project-folder
    key: ~/.ssh/id_ed25519
    send_paths:
      - script.sh
    get_paths:
      - results
pipeline:
  stages:
    run-the-job:
      kind: shell-script
      script_path: script.sh
      environment: my-cluster
      outputs:
        - results

In this setup, the pipeline would be executed from your local machine, not the cluster, so the DVC cache will also live locally.

What do you think? Would this make automating your projects easier? Are you able to provide any examples of your data pipelines?

rodrigo-pena Jul 18, 2025

An example pipeline from a recent project is the following. I have two SLURM scripts, make_dataset_from_psql_dump.sh and pdf2md.sh. They are both queued from a login node in the HPC cluster via sbatch. E.g. sbatch make_dataset_from_psql_dump.sh

The first script converts a Postgres dump into a "dataset" of CSV tables and PDFs:

.
├── tables
├── pdfs

The second converts the PDFs into Markdown with embedded images (by calling internally a python script pdf2md.py), yielding the directory structure

.
├── tables
├── pdfs
├── md

If I understand your suggestion, I would have to create two new bash scripts, say step_1.sh and step_2.sh that contain simply the sbatch <slurm_script> commands, add these to the calkit pipeline, and then run the pipeline from my local machine (i.e. not from the login node on the cluster)?

I checked the docs on SSH environments, and you mention "It is assumed that dependencies on the remote machine are managed separately." So far, my workflow has been to use a GitLab repository to "sync" code between my development machine and the cluster (login to the cluster, pull the changes, then execute code). The same environment specs are managed in both machines via uv. With calkit's SSH environments, I would then have to login to the HPC cluster, do git pull/uv sync to make sure the latest environment specs are installed, then log back to my local machine to launch pipeline jobs?

petebachant Jul 18, 2025
Maintainer Author

That would be the workflow in the current implementation, but I have thought a bit about a similar use case as yours, where the project repo lives on the cluster. That's how I used to do my work when I was on an HPC often (before DVC existed), so pulling, compiling the code and copying the results back to my local machine was a manual process.

I do see the value in having the project repo exist out on the cluster, using project-defined environments out there, and think that's a use case we should support. What is your ideal workflow? Do you want to be able to run your pipeline on the cluster via GitLab CI/CD, do you want to kick off the pipeline from your local machine and have certain steps run on the cluster, or do you want to log in to the cluster, run the pipeline there, and manually commit/push results back out?

Appendix: Some design brainstorming

One thing I could imagine is that an SSH environment has some properties that define if the project lives out there, and then as part of "checking" that environment we pull the repo (or fetch and check out the current version of the local repo) and check the uv environment (uv sync).

environments:
  my-cluster-uv:
    kind: ssh-slurm
    project_cloned: true
    sub_environment: my-uv?
pipeline:
  stages:
    run-step-1:
      kind: python-script
      environment: my-cluster # Automatically detects if we're running from a login node?

Need to figure out if the Slurm aspect is a property of the pipeline stage or the environment itself, or maybe we're talking about a different category here, like "machines".

machines:
  my-cluster:
    host:
    username:
    ssh_key:
    kind: slurm
    sync_repo: true
    wdir: $HOME/my-project
environments:
  my-uv:
    path: pyproject.toml
pipeline:
  stages:
    step1:
      kind: python-script
      script_path: step1.py
      environment: my-uv
      machine: my-cluster

If you wanted to run locally, you could call calkit run --ignore-machine. Related to #314.

`sbatch` stage type

We could have an sbatch Calkit stage type that actually becomes two DVC stages: One that submits the job, saves a file to identify that job, and then another that waits for its completion. This would allow disconnecting and resuming the pipeline later from the login node.

rodrigo-pena · 2025-07-18T19:10:16Z

rodrigo-pena
Jul 18, 2025

Re. ideal workflow: I don't know if I could trigger the pipelines from GitLab CI/CD because the runner is not on the cluster and does not have access to the compute nodes. And even if I could, some jobs are really long running (3 days) so I would prefer to trigger them manually. Now, as to whether triggering jobs from my local machine or from the cluster, I don't think there's much difference for me. I wouldn't mind having to log in to the cluster to run the pipelines.

From your design brainstorming, I like the idea of the "machines" category, because I could imagine projects where I would have access to some cloud computing service, or run things in another HPC center. The sbatch stage type also sounds good

3 replies

petebachant Jul 28, 2025
Maintainer Author

Makes sense. I'm working a project now that run some stages on a remote machine, and it seems like a machines category fits. I'm currently manually syncing data back and forth with Git/DVC, but considering if scp or rsync would be better when running from my local machine.

petebachant Sep 15, 2025
Maintainer Author

Calkit v0.30.0, released today, now includes a slurm environment and sbatch stage type. It does require calling calkit run while logged in to the cluster, and that the repo is cloned there. The pipeline stages run synchronously, waiting for the jobs to complete, but the process is robust to disconnects, i.e., it will resume waiting for a job if the pipeline is rerun and one was already submitted. Jobs that are now invalid due to modified dependencies will also be cancelled and resubmitted.

See the docs here: https://docs.calkit.org/pipeline/slurm/

And an example project here: https://github.com/petebachant/clima-gpu-profiling

rodrigo-pena Sep 15, 2025

@petebachant That's excellent news! I'm gonna try it out here

Calkit

Are you working reproducibly? Why or why not? #227

Uh oh!

Uh oh!

petebachant Jan 29, 2025 Maintainer

Replies: 2 comments · 6 replies

Uh oh!

rodrigo-pena Jul 17, 2025

Uh oh!

petebachant Jul 17, 2025 Maintainer Author

Uh oh!

rodrigo-pena Jul 18, 2025

Uh oh!

petebachant Jul 18, 2025 Maintainer Author

Appendix: Some design brainstorming

sbatch stage type

Uh oh!

rodrigo-pena Jul 18, 2025

Uh oh!

Uh oh!

petebachant Jul 28, 2025 Maintainer Author

Uh oh!

petebachant Sep 15, 2025 Maintainer Author

Uh oh!

rodrigo-pena Sep 15, 2025

petebachant
Jan 29, 2025
Maintainer

Replies: 2 comments 6 replies

rodrigo-pena
Jul 17, 2025

petebachant Jul 17, 2025
Maintainer Author

petebachant Jul 18, 2025
Maintainer Author

`sbatch` stage type

rodrigo-pena
Jul 18, 2025

petebachant Jul 28, 2025
Maintainer Author

petebachant Sep 15, 2025
Maintainer Author