Merge request approval workflow for data versioning #9907
Replies: 3 comments 13 replies
-
@pirnerjonas can we please explore this a bit more before we discuss some possible solutions? I'm curious what do we even mean by "update the data".
WDYT? |
Beta Was this translation helpful? Give feedback.
-
@shcheklein yes, certainly.
We use Amazon S3 as dvc remote with
If I understand correctly the presented metrics in this MR / PR would then reflect what already changed in the dataset (so if a project is dependend on this data by e.g doing So in general we want to establish some kind of review process (e.g. via MR / PR) for a new dataset version we are about to create because it will be used in production downstream applications (e.g. via Hope this makes it a little bit clearer what we are trying to get to. |
Beta Was this translation helpful? Give feedback.
-
I also thought about this two remote setup before.. I tried to implement this approach however I currently have some problems with it: .dvc/config
I went ahead and pushed some of my local files to the --remote-dev by .gitlab-ci.yml data-push:
stage: deploy
image: ...
before_script:
- poetry install --only main
script:
- poetry run dvc pull --remote remote-dev -v
- poetry run dvc push --remote remote-prod -v
needs: [authenticate]
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH Do you have an idea what I am doing wrong? Was this the setup you meant? |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I am wondering how you would implement a kind of merge request approval workflow for our dvc data repository.
Currently if we want to update the data in our s3 backend a developer does all the changes locally and pushes to the remote via
dvc push
. The developer can do something like advc diff
ordvc data status
to check the changes locally she is about to make but we are currently thinking how you would implement something like this in a CICD pipeline (we are using Gitlab).So ideally I would create a merge request that shows me the changes I am about to do to my data on s3 backend and only after I merge I would update the data.
What I tried
When I tried working on this I realized that if I create a new data set version locally via dvc (change some files,
dvc commit
,git commit
.dvc
files,git push
) and create a merge request I can get the data set diff between the two branches e.g. like this:But if I understood it correctly if I now want to do the
dvc push
in the CICD pipeline I would need access to the.dvc/cache
folder which I out of the box don't have access to in my runner environment.Maybe it is possible to share the cache locally and on the runner via a shared cache but since I am currently struggeling with this I thought I raise the more general question about this setup here.
Beta Was this translation helpful? Give feedback.
All reactions