Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lots of class C transactions #9

Open
Schnouki opened this issue Nov 14, 2016 · 11 comments
Open

Lots of class C transactions #9

Schnouki opened this issue Nov 14, 2016 · 11 comments

Comments

@Schnouki
Copy link

This remote uses lots of class C transactions to the B2 API, which can be quite expensive. I think this is mostly due to the calls to ListFileNames() for each operation. Could it be possible to replace them with "simple" calls to GetFileInfo(), a class B operation?

Thanks a lot for your work!

@encryptio
Copy link
Owner

I think the original reason for doing that is that there was no other way to get all versions of the file, which was needed in some error recovery cases. That was only required on writes, so I think it could be improved during reads, if that's not already true.

Also, list_file_names used to be a class B operation iff maxFileCount was 1, which it is in this remote. That doesn't seem to be the case anymore...

I'll look into this more this weekend when I have some time.

@timsomers
Copy link

Hi, I just started using this yesterday and already burned through the class C operations included in the free tier after only 2 megabytes of data. Have you found any time to look at this? Unfortunately I don't know golang at all.

@encryptio
Copy link
Owner

encryptio commented Dec 3, 2016

It doesn't look like GetFileInfo is usable as-is, since it takes a B2 fileID, which B2 generates on the fly at upload time. It might be possible to store extra data in the git-annex branch with SETSTATE on upload time, but it seems like that'll run into interesting issues wrt conflicting uploads on the same key (with different chunking values, for example.) Solvable, but very non-trivial. I'll keep this issue open for that improvement.

That said, there are workarounds:

I think there are so many class C calls because git-annex is calling checkpresentkey a LOT, even on operations that aren't obviously using that remote (for example, a local git annex drop), which it does for all remotes whose trust level is semitrusted (the default, which means that the remote is expected to lose data sometimes.) Changing it to trusted with git annex trust b2remotename should get rid of almost all checkpresentkey calls (the notable exception being git annex fsck --fast --from b2). It's also useful to set the annex-cost of this remote relatively high if you have other, non-paid remotes connected so that git-annex will prefer using them when possible.

Since these are helpful, non-default, and non-obvious, I'll update the readme to mention them.

encryptio added a commit that referenced this issue Dec 3, 2016
encryptio added a commit that referenced this issue Dec 3, 2016
@encryptio
Copy link
Owner

@timsomers Could you try the git annex trust command and see if that helps for you? It should, but I might have missed some other way those transactions occur.

@timsomers
Copy link

hi @encryptio My full command is already "git annex copy --to b2 --not --in b2 --trust b2" so additionally trusting that repo should not make a difference. I've changed it anyway to make sure, but we'll have to wait until tomorrow to see the result.

Can't you just cache the output of list_file_names in a temp file and refresh it eg. once a minute?

@timsomers
Copy link

Trusting the repo changed nothing. Checking the reports page it seems 2 list calls are made for each upload:

b2_list_file_names 18,191
b2_upload_file 9,095

Is this necessary? Can't we at least reduce this to a single call?

encryptio added a commit that referenced this issue Dec 10, 2016
There's a possible race condition if multiple processes upload the same
file at the same time, but that race condition already existed and does
not cause any data loss (instead, it causes a small waste of space in
B2) and the inconsistency is corrected after removal by a git annex
fsck.

I think people will naturally avoid the race condition. Fixing it would
require some extra API calls to ListFileVersions during removal.

Improves #9
@encryptio
Copy link
Owner

@timsomers Added a cache to the ListFileNames call. Could you try that out?

I had to think pretty hard about making sure it's actually safe to do so, and ended up concluding that it's not significantly worse than the existing race condition (see 2bf053c for details on the race.)

@timsomers
Copy link

I've built this and it does indeed improve things. Now I manage to upload about 2.7k files with 2.5k transactions. I wanted to try with a longer cache time (I don't believe the race condition you mentioned applies to me, as I only push from a single repo) but I didn't immediately find how to rebuild my custom code, just the go get github.com/encryptio/git-annex-remote-b2 command which pulls in your repo.

@encryptio
Copy link
Owner

encryptio commented Dec 12, 2016

If you'd like to adjust and rebuild, edit the source in $GOPATH/src/github.com/encryptio/git-annex-remote-b2/main.go then run go install from that directory and it'll place a built binary in $GOPATH/bin like go get did.

A longer cache time wouldn't improve things, since the thing it's caching is for a single upload. git-annex calls CHECKPRESENT immediately before calling STORE, and this one-item in-memory cache reuses the result of CHECKPRESENT for the first half of the STORE operation.

Getting better than one ListFileNames call per upload is very difficult, would roughly double the amount of data and changes in the git-annex branch (notable for people who have large git-annex repos, like me), and would not be a backwards-compatible change, so there'd have to be a new versioning system put in place for the config operations, which comes with its own difficulties (like testing complications and the very unobvious "remove a remote completely and add it again but then my data is gone" problem (because it uses a different config version which makes different assumptions about the data format)).

@meristo
Copy link

meristo commented Jun 8, 2017

I've been using this and noticed the high number of class C transactions, so I modified it to be able to cache the full bucket contents to memory for an entire invocation, or for a duration which can be set by the user: meristo@7ef35c6

@greggrossmeier
Copy link

Has anyone used @meristo 's patch? I'll try giving it a go in the next week and report back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants