Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Strangeness with dataset.merge() #2015

Open
daniel-falk opened this issue Nov 20, 2022 · 5 comments
Open

[BUG] Strangeness with dataset.merge() #2015

daniel-falk opened this issue Nov 20, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@daniel-falk
Copy link
Contributor

🐛🐛 Bug Report

To start with, awesome work getting the merge functionality implemented! I have used it for a while now and discovered a few issues with it:

Text tensors are not merged

When two branches are merged, where both branches has the same tensors, some tensors are empty after merge. This applies to e.g. htype="text" and htype="generic". See example on colab here.

UI shows errors during a merge

If using the ActiveLoop user interface while two branches are merging the dataset fails to load and there are error messages "Dataset is corrupted or does not exist".

Exception about modification on read-only tensor during merge

While merging two branches I got this exception once. Rerunning with the same exact code a second time worked.

Error in sys.excepthook:                                                                                                                                                                                                                                  
Traceback (most recent call last):                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/humbug/report.py", line 498, in _hook                                                                                                           
    self.error_report(error=exception_instance, tags=tags, publish=publish)                                                                                                                                                                               
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/humbug/report.py", line 244, in error_report                                                                                                    
    traceback.format_exception(                                                                                                                                                                                                                           
TypeError: format_exception() got an unexpected keyword argument 'etype'                                                                                                                                                                                  
                                                                                                                                                                                                                                                          
Original exception was:                                                                                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/diff_commits.py", line 84, in <module>                                                                                                                                            
    diff_commits()                                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 1130, in __call__                                                                                                          
    return self.main(*args, **kwargs)                                                                                                                                                                                                                     
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 1055, in main                                                                                                              
    rv = self.invoke(ctx)                                                                                                                                                                                                                                 
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 1404, in invoke                                                                                                            
    return ctx.invoke(self.callback, **ctx.params)                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/click/core.py", line 760, in invoke                                                                                                             
    return __callback(*args, **kwargs)                                                                                                                                                                                                                    
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/diff_commits.py", line 79, in diff_commits                                                                                                                                        
    ds_new.merge(f"debug4/diff-{hash1}-{hash2}-removed")                                                                                                                                                                                                  
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/humbug/report.py", line 445, in wrapped_callable                                                                                                
    return callable(*args, **kwargs)                                                                                                                                                                                                                      
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/invalid_view_op.py", line 22, in inner                                                                                            
    return callable(x, *args, **kwargs)                                                                                                                                                                                                                   
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/iteration_warning.py", line 11, in inner                                                                                          
    res = callable(x, *args, **kwargs)                                                                                                                                                                                                                    
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/dataset/dataset.py", line 1193, in merge                                                                                          
    merge(self, target_id, conflict_resolution, delete_removed_tensors, force)                                                                                                                                                                            
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/merge.py", line 59, in merge                                                                                                      
    merge_common_tensors(common_tensors, dataset, target_ds, nodes, conflict_resolution)                                                                                                                                                                  
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/merge.py", line 289, in merge_common_tensors                                                                                      
    merge_tensor_data(                                                                                                                                                                                                                                    
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/merge.py", line 472, in merge_tensor_data                                                                                         
    original_tensor.append(sample)                                                                                                                                                                                                                        
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/invalid_view_op.py", line 22, in inner                                                                                            
    return callable(x, *args, **kwargs)                                                                                                                                                                                                                   
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/tensor.py", line 377, in append                                                                                                   
    self.extend([sample], progressbar=False)                                                                                                                                                                                                              
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/util/invalid_view_op.py", line 22, in inner                                                                                            
    return callable(x, *args, **kwargs)                                                                                                                                                                                                                   
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/tensor.py", line 306, in extend                                                                                                   
    self._write_initialization()                                                                                                                                                                                                                          
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/tensor.py", line 259, in _write_initialization                                                                                    
    self.storage.check_readonly()                                                                                                                                                                                                                         
  File "/Users/danielfalk/testprgm/sandbox/deeplake/versioning_tutorial/venv/lib/python3.10/site-packages/deeplake/core/storage/provider.py", line 153, in check_readonly                                                                                 
    raise ReadOnlyModeError()                                                                                                                                                                                                                             
deeplake.util.exceptions.ReadOnlyModeError: Modification when in read-only mode is not supported!

Removed samples are not removed after a merge

Data samples that are removed on a branch does not get removed when that branch is merged.

This might be a design decision, that when merging it is the current state of the source branch that is merged, not the changes/deltas on the branch. This "union" behaviour feels very unnatural for a "merge" and is not how it works in e.g. git. I could not find any documentation explaining this behaviour.

See example on colab here.

Merged commits are not visible in history

One of the main advantages with version control of the data is that I can look in the log and see what different changes people has done to the dataset. This does not really work as is now since a merge only creates a "merged ..." commit message, it does not show the commits that was done to other branches. This means that if using trunk-based development where the changes are always done on a feature branch, committed with a self-explaining commit message and then merged with master, there will not be any useful commit messages on the main branch, only "merge ..." messages. I think we would really need something like gitk etc. that shows the different commits on the branches that are merged.

@daniel-falk daniel-falk added the bug Something isn't working label Nov 20, 2022
@istranic
Copy link
Contributor

Hi @daniel-falk thank you for raising the issue and for the reproduction script. We'll take a look at these issues asap.

@istranic
Copy link
Contributor

Hi @daniel-falk We had some internal miscommunication that led to the merge strangeness. Long-story-short, merging will bring back samples that were popped on other branches, and there are some technical limitation that resulted in this behavior.

Enabling merges to respect popped samples is not a problem if a merge happens once. However the technical problem occurs if a merge has already happened, and then another sample is popped on one of the branches, and a merge happens again. One way for us to move forward is to leave that behavior alone (merge respects popped samples at the first merge, but not afterwards), or we could go one step further and lock branches after they've been merged to main. Do you have any thoughts on which option is better from a user-perspective?

@daniel-falk
Copy link
Contributor Author

First, let me just make sure I understand what cases would work and what would not. In this image, number 1 and 3 would be OK, not 2 and 4?

DSC_1512

Number 2 could be solved by locking the branch. Once branch 2 is merged to master, we could lock branch 2 from accepting more commits.
DSC_1513

Number 4 is not as obvious how to block. When branch 2 is merged, there are already new commits in branch 3 that is based on branch 2 (we could technically get into this situation in the image above too, but not very likely). We could block during the merging instead, refusing to merge a branch based on another branch that is already merged. This would however be frustrating since you get the information after that you have done all the work. In this case you need to rebase the branch to the merge commit on main, but as is now this can be hours of (error prone) work since you need to redo each step manually, many of which are frustratingly slow like ds.pop().
DSC_1514

Do I understand it correctly that this specific situation would be problematic?
DSC_1515

Out of curiosity it would be interesting to know exactly what is the technical limitation. I can't see why it would not be enough to use the _{key}_id tensors to see what changes has been done on it since the first already merged ancestor. Is this technical limitation a problem in git too, or how have they solved it or why is it not a problem there?
DSC_1515_2

To answer your question, based on the information I have now, I would say:

  • Blocking branches from being merged more than once would be a bit frustrating, but better than not having the ability to remove objects by merging as it is now
  • Silently skipping the removal of objects removed in a branch where they have been removed before would be super frustrating, however if you can do it with a good user interaction, then it could be OK. It might e.g. be to add a new merge strategy union and raise an exception on merge of already merged branch unless the merge strategy is explicitly set to union.

Another thing to consider while on the topic of merges... Currently, as I understand it, it is only the ID vectors that are used for merging. One issue with this is when using the parallel compute functions, and another is when changing multiple tensors. Let's say that I have a tensor classes and I have another binary tensor is_good which is derived from classes. First I might consider only classes==1 as good:

ds.is_good.extend(ds.classes.numpy() == 1)

then I'll change my strategy to also consider classes==2 as good. The quick way to update the tensor is:

ds.is_good[:] = (ds.classes.numpy() == 1) | (ds.classes.numpy() == 2)

but this would change all the tensors, probably causing issues when merging. The way to get around this would be to do like this, but its a bit more complex and hard to remember to always to that way:

new_is_good = (ds.classes.numpy() == 1) | (ds.classes.numpy() == 2)
changed = new_is_good != ds.is_good.numpy()
ds.is_good[np.where(changed)[0].tolist()] = new_is_good[changed]

One way to solve this would be to use the hash of the values instead. This does of course not work with large objects since it would be slow and not with objects with lossy compression since they wont be bitwise exact after reencoding them. On the other hand, I would most probably not have this problem with large objects since they are too large to fit the full list of tensors in the memory so I would instead iterate over them and only change them if needed, and with objects that has lossy compression I would not like to do it this way anyhow since reencoding the same frame would degrade the image quality. So perhaps, one option would be to use hash-values for htypes like class_label?

@AbhinavTuli
Copy link
Contributor

Hey @daniel-falk.
You're correct case 2 and case 4 will not work.
The problem is that we had designed our version control 2 years ago with the following assumptions:-

  • The version control history will be stored as a tree i.e. each node will have only 1 parent.
  • Data won't be deleted

Further down the line we got requests for merge, which doesn't follow assumption 1. We got around this by making merge commits act as independent commits in which all the history is replayed.

With the introduction of pop, assumption 2 was also discarded. The way merge works is that we compare the ids of tensors in the commits and add the missing ids from target commit to the original commit. The issue with pop becomes that it is hard to differentiate between cases where an id was added on target vs when id was deleted on original and not on target.
We have plans to overhaul the entire version control system to better support merges in the near future.

Re hash values, we deliberately don't use them and prefer uuids instead for the reason that we want to keep track of samples and how they change over time. For example say a label was added with value 5 at index 10. The user then branches off to an alternate branch (alt1)and there after a few pops in the dataset it is at index 7. After that it is merged into another branch (alt2) which didn't have the sample at all, at index 25. The value is then changed from 5 to 1000.
Now if we merge alt2 into main, we need a unique id to understand that it is the same sample, although at a different index and with a different value, which is why hashes won't work.

@daniel-falk
Copy link
Contributor Author

I see, I was wrong above stating that this

ds.is_good[:] = (ds.classes.numpy() == 1) | (ds.classes.numpy() == 2)

would prevent correct merging. The IDs are never changed when writing to the tensor by index, it is only created when appending or extending and only goes away on a pop. Merging should thus be fine unless a compute function was used.

I understand the issue with pop and merge, I guess it is hard to avoid without using some kind of journal. Is the idea to use journals when overhauling the versioning? That would also open up for more advanced commands like rebasing a commit. It feels like if there should be a good library for dealing with journals, it's a common thing to do both in version control and in databases/filesystems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants