Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --blob-exec to run system commands for each blob #169

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

nightscape
Copy link

This is a rebase of Paul Draper's implementation of blob-exec:
#83

Is there any chance to merge this? I'm currently using it to replace tabs by spaces exactly like Paul mentioned it under his use cases.

@javabrett
Copy link
Contributor

I'm interested in whether there are some metrics that show how much faster this is than running the equivalent filter-branch, if anyone had those numbers.

@OwnageIsMagic
Copy link

OwnageIsMagic commented Nov 24, 2016

It doesnt work for me
image
What I am missing?
bfg2="java -jar /home/ubuntu/bfg-repo-cleaner/bfg/target/bfg-1.12.4-SNAPSHOT-paul-blob-exec-234ba67.jar"

Exception in thread "main" java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: java.io.IOException: Stream closed
        at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
        at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
        at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
        at com.madgag.git.bfg.MemoUtil$$anonfun$concurrentCleanerMemo$1$$anon$1.apply(memo.scala:60)
        at com.madgag.git.bfg.GitUtil$$anon$1.apply(GitUtil.scala:69)
        at com.madgag.git.bfg.CleaningMapper$class.replacement(GitUtil.scala:44)
        at com.madgag.git.bfg.GitUtil$$anon$1.replacement(GitUtil.scala:68)
        at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1$$anonfun$2.apply(ProtectedObjectDirtReport.scala:44)
        at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1$$anonfun$2.apply(ProtectedObjectDirtReport.scala:44)
        at scala.util.Either.fold(Either.scala:99)
        at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1.apply(ProtectedObjectDirtReport.scala:44)
        at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$$anonfun$reportsFor$1.apply(ProtectedObjectDirtReport.scala:42)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
        at scala.collection.Iterator$class.foreach(Iterator.scala:750)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1202)
        at scala.collection.MapLike$DefaultKeySet.foreach(MapLike.scala:174)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
        at scala.collection.AbstractSet.scala$collection$SetLike$$super$map(Set.scala:47)
        at scala.collection.SetLike$class.map(SetLike.scala:92)
        at scala.collection.AbstractSet.map(Set.scala:47)
        at com.madgag.git.bfg.cleaner.protection.ProtectedObjectDirtReport$.reportsFor(ProtectedObjectDirtReport.scala:42)
        at com.madgag.git.bfg.cleaner.CLIReporter.reportProtectedCommitsAndTheirDirt(Reporter.scala:113)
        at com.madgag.git.bfg.cleaner.CLIReporter.reportObjectProtection(Reporter.scala:86)
        at com.madgag.git.bfg.cleaner.RepoRewriter$.rewrite(RepoRewriter.scala:94)
        at com.madgag.git.bfg.cli.Main$$anonfun$1.apply(Main.scala:59)
        at com.madgag.git.bfg.cli.Main$$anonfun$1.apply(Main.scala:34)
        at scala.Option.map(Option.scala:146)
        at com.madgag.git.bfg.cli.Main$.delayedEndpoint$com$madgag$git$bfg$cli$Main$1(Main.scala:33)
        at com.madgag.git.bfg.cli.Main$delayedInit$body.apply(Main.scala:27)
        at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
        at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.App$$anonfun$main$1.apply(App.scala:76)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
        at scala.App$class.main(App.scala:76)
        at com.madgag.git.bfg.cli.Main$.main(Main.scala:27)
        at com.madgag.git.bfg.cli.Main.main(Main.scala)
Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Stream closed
        at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
        at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
        at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
        at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
        at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)
        at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
        at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
        at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
        at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
        at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
        at com.madgag.git.bfg.MemoUtil$$anonfun$concurrentCleanerMemo$1$$anon$1.apply(memo.scala:60)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
        at scala.collection.immutable.List.map(List.scala:285)
        at com.madgag.git.bfg.cleaner.TreeBlobModifier$class.apply(TreeBlobModifier.scala:38)
        at com.madgag.git.bfg.cli.CLIConfig$$anonfun$blobExecModifier$1$$anon$3.apply(CLIConfig.scala:181)
        at com.madgag.git.bfg.cli.CLIConfig$$anonfun$blobExecModifier$1$$anon$3.apply(CLIConfig.scala:181)
        at scala.Function$$anonfun$chain$1$$anonfun$apply$1.apply(Function.scala:24)
        at scala.Function$$anonfun$chain$1$$anonfun$apply$1.apply(Function.scala:24)
        at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
        at scala.collection.immutable.List.foldLeft(List.scala:84)
        at scala.collection.TraversableOnce$class.$div$colon(TraversableOnce.scala:136)
        at scala.collection.AbstractTraversable.$div$colon(Traversable.scala:104)
        at scala.Function$$anonfun$chain$1.apply(Function.scala:24)
        at com.madgag.git.bfg.cleaner.ObjectIdCleaner$$anonfun$4.apply(ObjectIdCleaner.scala:124)
        at com.madgag.git.bfg.cleaner.ObjectIdCleaner$$anonfun$4.apply(ObjectIdCleaner.scala:118)
        at com.madgag.git.bfg.MemoUtil$$anon$3.load(memo.scala:74)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
        ... 41 more
Caused by: java.io.IOException: Stream closed
        at java.lang.ProcessBuilder$NullOutputStream.write(ProcessBuilder.java:434)
        at java.io.OutputStream.write(OutputStream.java:116)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at com.madgag.git.bfg.cleaner.BlobExecModifier$class.fix(BlobExecModifier.scala:25)
        at com.madgag.git.bfg.cli.CLIConfig$$anonfun$blobExecModifier$1$$anon$3.fix(CLIConfig.scala:181)
        at com.madgag.git.bfg.cleaner.TreeBlobModifier$$anonfun$1.apply(TreeBlobModifier.scala:32)
        at com.madgag.git.bfg.cleaner.TreeBlobModifier$$anonfun$1.apply(TreeBlobModifier.scala:31)
        at com.madgag.git.bfg.MemoUtil$$anon$3.load(memo.scala:74)
        at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
        at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
        ... 67 more

@dmgerman
Copy link

dmgerman commented May 2, 2017

I'm interested in whether there are some metrics that show how much faster this is than running the
equivalent filter-branch, if anyone had those numbers.

Assuming that all files are replaced with new versions, this is my estimation of the cost for each.

given my experience with this branch (running on linux) and filter-branch, my view is that the savings are a function that depends:

  1. On the number of commits C
  2. On the number of unique files in the project UF
  3. On the number of files replaced in a given commit
  4. On the number of files that exist in the file system at any given time
  5. the speed of the file system,
  6. the time it takes to process each individual file.

With BFG, the cost of the replacement is a function of (k1 * C + k2 * UF)/(k3*SpeedFileSystem)

With filter-branch, this is what happens:

For every commit:

  • checkout and replace files that are different in the file system with respect to their contents according to the commit
  • run filter branch on each file (no duplicates detected)
  • create new commit

For example, say a filter replaces the contents of every file in every revision.

For revision 1 the contents are checked-out. Lets say we have n1 files.
n1 files are replaced with new versions.
Commit these changes.
Now, this is the true expensive operation: checkout the next commit. This means, replacing every single file in the tree (even those that were not part of the actual changes) with their version according to the commit. Now we have to process again every single file in the repo (they are all dirty). After they are all processed, git will realize that only the files that were in commit 2 are actually changed (with respect to the processed version of commit 1)

This process basically makes filter-branch cost include:

  • cost of recreating every file in every revision (cost of checking out and writing files in repo--avg files in repo * # of commits)
  • cost of processing every file in every revision (cost of processing: avg files in repo * # commits)

so, bfg is proportional to the avg number of files in commit multiplied by the number of commits (basically, the number of unique BLOB files found in a repo) while the cost of filter-branch is proportional to th avg number of files IN repo multiplied by number of commits.

Let's say we can do 100 file processed operations per second. And we have 1M commits, with an average of 10k files in the repo, and 10 files modified per commit (I am thinking linux here).

Let us assume that the cost of checking out the files (filter-branch) and processing the commits (bfg and filter-branch) is neglegible (it is not, but bear with me). if my numbers are right, it would take:

(1M * 10k)/100 seconds to process this repo => 1157 days => 3 years

with BFG it would take:

(1M * 10) /100 seconds => 27 hrs.

so, in conclusion: BFG processing time is Order(#commits *#avgNumberOfFilesPerCommit)
while the filter-branch processing time is Order(#commits * #avgNumberOfFilesInRepo).

@A-red-BREASTED-robin
Copy link

May I be allowed to use and learn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants