Merge request approval workflow for data versioning #9907

pirnerjonas · 2023-09-04T08:10:02Z

pirnerjonas
Sep 4, 2023

Hi all,

I am wondering how you would implement a kind of merge request approval workflow for our dvc data repository.

Currently if we want to update the data in our s3 backend a developer does all the changes locally and pushes to the remote via dvc push. The developer can do something like a dvc diff or dvc data status to check the changes locally she is about to make but we are currently thinking how you would implement something like this in a CICD pipeline (we are using Gitlab).

For our infrastructure projects managed via terraform and gitlab we can leverage merge request approval workflow / 4 eye principal within CICD pipelines e.g. like this:

stages:
  - plan
  - deploy

plan_terraform:
  image:
    name: ...
  script:
    - terraform init -input=false
    - terraform plan -input=false 
  except:
    - master

deploy_terraform:
  image: ...
  script:
    - terraform init -input=false
    - terraform apply -input=false -auto-approve
  only:
    - master

Via terraform plan you see the changes you are about to make but you don't carry them out. After reviewing and merging to master the changes would be applied to the cloud environment.

So ideally I would create a merge request that shows me the changes I am about to do to my data on s3 backend and only after I merge I would update the data.

What I tried

When I tried working on this I realized that if I create a new data set version locally via dvc (change some files, dvc commit, git commit .dvc files, git push) and create a merge request I can get the data set diff between the two branches e.g. like this:

...
data-check:
  stage: build
  image: ...
  before_script:
    - poetry install --only main
    - git fetch origin $CI_MERGE_REQUEST_TARGET_BRANCH_NAME
    - CURRENT_COMMIT=$(git rev-parse HEAD)
    - TARGET_COMMIT=$(git rev-parse origin/$CI_MERGE_REQUEST_TARGET_BRANCH_NAME)
  script:
    - echo $CURRENT_COMMIT
    - echo $TARGET_COMMIT
    - poetry run dvc diff $TARGET_COMMIT $CURRENT_COMMIT --md -v --show-hash
  rules:
    - if: $CI_MERGE_REQUEST_IID
...

But if I understood it correctly if I now want to do the dvc push in the CICD pipeline I would need access to the .dvc/cache folder which I out of the box don't have access to in my runner environment.

Maybe it is possible to share the cache locally and on the runner via a shared cache but since I am currently struggeling with this I thought I raise the more general question about this setup here.

shcheklein · 2023-09-05T17:18:48Z

shcheklein
Sep 5, 2023
Maintainer

So ideally I would create a merge request that shows me the changes I am about to do to my data on s3 backend and only after I merge I would update the data.

@pirnerjonas can we please explore this a bit more before we discuss some possible solutions?

I'm curious what do we even mean by "update the data".

do you use a regular DVC remote or a cloud versioned one?
why is it not enough to always push data and approve metadata changes (.dvc, dvc.lock) in a MR / PR? You can still present some metrics, diff, etc (we can discuss that part also), but you won't need pushing any data on CI. Yes, clearly it means that you can have some extra data in your remote storage, etc - but you can always guarantee that those versions that got merged are staying there.

WDYT?

0 replies

pirnerjonas · 2023-09-06T06:53:37Z

pirnerjonas
Sep 6, 2023
Author

@shcheklein yes, certainly.

do you use a regular DVC remote or a cloud versioned one?

We use Amazon S3 as dvc remote with version_aware flag enabled

why is it not enough to always push data and approve metadata changes (.dvc, dvc.lock) in a MR / PR?

If I understand correctly the presented metrics in this MR / PR would then reflect what already changed in the dataset (so if a project is dependend on this data by e.g doing dvc get they would already cosume the updated data even though the MR / PR hasn't been approved). Maybe I should have also added that we want to use this repo as a kind of data registry.

So in general we want to establish some kind of review process (e.g. via MR / PR) for a new dataset version we are about to create because it will be used in production downstream applications (e.g. via dvc get).

Hope this makes it a little bit clearer what we are trying to get to.

1 reply

shcheklein Sep 7, 2023
Maintainer

Yes, I understand. version_aware flag makes it less obvious, but in general with DVC (and especially data registry) you are relying on Git repo (a specific revision, let's say the main branch latest commit). So, metadata (in case of a version_aware remote .dvc, dvc.lock files have version ids to identify specific versions of the file).

Another way to put it. No matter how many updates people push, if you use dvc get and/or dvc import and/or dvc pull - you will be able to get the specific version from a particular commit.

Saying that, I understand that we would want to avoid creating extra layers of objects on the cloud storage, and may be keep it "clean" as much as possible.

In this case, the only way that comes to my mind is to use two remotes. One is setup in way that anyone could push to. On the CI job you push into the "golden" / "production" one. Would something like this work for you?

pirnerjonas · 2023-09-11T08:03:21Z

pirnerjonas
Sep 11, 2023
Author

I also thought about this two remote setup before.. I tried to implement this approach however I currently have some problems with it:

.dvc/config

[core]
    autostage = true
['remote "remote-prod"']
    url = s3://dvcstore/prod
    version_aware = true
['remote "remote-dev"']
    url = s3://dvcstore/dev
    version_aware = true

I went ahead and pushed some of my local files to the --remote-dev by dvc push --remote remote-dev and I correctly see the files popping up in the s3 bucket location s3://dvcstore/dev. However after running the CICD pipeline (like below) the s3://dvcstore/prod stays empty even though the output in the CICD pipeline suggests otherwise.

.gitlab-ci.yml

data-push:
  stage: deploy
  image: ...
  before_script:
    - poetry install --only main
  script:
    - poetry run dvc pull --remote remote-dev -v
    - poetry run dvc push --remote remote-prod -v
  needs: [authenticate]
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH

Do you have an idea what I am doing wrong? Was this the setup you meant?

12 replies

dberenbaum Sep 19, 2023
Collaborator

@pirnerjonas What is the benefit for you of having 2 remotes? Even if new data gets pushed to the remote, the git repo will continue to point to the old data until the changes are merged. The production system using dvc get will still retrieve the old data until that point.

I think another tool relied on the 'clear' names of the files within the bucket but I will try to work around that for now.

If you have tools that rely on the 'clear' names of the files, using a version_aware remote might not be what you want. It will show plain file names, but there is no guarantee that the current version of a file is the one that you consider the prod version. For example, if you want to revert to an older version of the data, DVC will see that it is already stored in the previous versions and avoid making it the current version.

There has been discussion of a put-url feature that would do a simple upload of your data. Maybe that is closer to what you want? If so, a workaround might be to have your CI script do a dvc pull of the data and then use the aws cli to copy it to the prod location.

shcheklein Sep 19, 2023
Maintainer

Dave, the way I understood the need is to mitigate this part It will show plain file names, but there is no guarantee that the current version of a file is the one that you consider the prod version.. With a dedicated "prod" remote we can guarantee that data in it corresponds to the latest commit in the main branch for example. @pirnerjonas correct me if I'm wrong here.

It's indeed looks like put-url potentially, or even reminds be an external checkout - where we needed to have a certain state on the storage.

pmrowla Sep 21, 2023

With a dedicated "prod" remote we can guarantee that data in it corresponds to the latest commit in the main branch for example

DVC does not set delete flags on files in cloud versioned remotes, so this is only the case if you are only ever appending data. If a file was previously pushed to prod, s3 will still show that file even if it does not exist in a later commit.

IMO if the use case is to guarantee that prod is the same as "latest main commit", it would be better to use a regular (non-cloud versioned) DVC remote and then use Git+DVC as the actual source of truth to get what's in the latest main commit (i.e. dvc get <repo-url> <path> --rev=main).

And yes, if there is a requirement that the latest data has to be in the s3 bucket, this is just put-url (or really more like a potential export which would presumably delete anything from the destination that is not part of our exported data)

pirnerjonas Sep 22, 2023
Author

Thanks for clearing some stuff up here I was honestly also not aware of (e.g. DVC does not set delete flags on files in cloud versioned remotes, so this is only the case if you are only ever appending data..

My original intent was to have the lastest data in the s3 bucket which I understand would be achievable by either a potential put-url or as @dberenbaum suggested by dvc pull and then an aws cli cp command.

And yeah, this original setup with the 2 remotes was my attempt to somehow allow a four-eyes principle before doing a dvc push to S3. So instead of locally doing it alone moving this process to a CI/CD pipeline to increase transparency and tracability (e.g. diff between the version you want to merge and the one that is recently on main). Hope this part makes sense. If so if you guys have general advice on this it would be highly welcomed.

shcheklein Sep 22, 2023
Maintainer

e.g. diff between the version you want to merge and the one that is recently on main

What kind of statistics / diffs / reports would you want to see? We have some examples where we compare main branch with a PR and show the number of files updated, added, etc. Do you have anything else in mind?

To clarify, I don't think that for this particular scenario you would need two remotes.

Overall, CI/CD approach makes perfect sense to me!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge request approval workflow for data versioning #9907

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Merge request approval workflow for data versioning #9907

pirnerjonas Sep 4, 2023

What I tried

Replies: 3 comments · 13 replies

shcheklein Sep 5, 2023 Maintainer

pirnerjonas Sep 6, 2023 Author

shcheklein Sep 7, 2023 Maintainer

pirnerjonas Sep 11, 2023 Author

dberenbaum Sep 19, 2023 Collaborator

shcheklein Sep 19, 2023 Maintainer

pmrowla Sep 21, 2023

pirnerjonas Sep 22, 2023 Author

shcheklein Sep 22, 2023 Maintainer

pirnerjonas
Sep 4, 2023

Replies: 3 comments 13 replies

shcheklein
Sep 5, 2023
Maintainer

pirnerjonas
Sep 6, 2023
Author

shcheklein Sep 7, 2023
Maintainer

pirnerjonas
Sep 11, 2023
Author

dberenbaum Sep 19, 2023
Collaborator

shcheklein Sep 19, 2023
Maintainer

pirnerjonas Sep 22, 2023
Author

shcheklein Sep 22, 2023
Maintainer