Merge branch 'en/important-misc-fixes'

Several fixes, from miscellaneous documentation improvements, up to a gnarly bug in _IDs.record_rename(), and in important improvement in how already_ran is interpreted if left around. Signed-off-by: Elijah Newren <[email protected]>
newren · Oct 21, 2024 · 9388cd4 · 9388cd4
2 parents db7e07d + 8a243ae
commit 9388cd4
Show file tree

Hide file tree

Showing 5 changed files with 255 additions and 24 deletions.
diff --git a/Documentation/git-filter-repo.txt b/Documentation/git-filter-repo.txt
@@ -54,6 +54,9 @@ can be overridden, but they are all on by default):
   * pruning commits which become empty due to the above filters (also
     handles edge cases like pruning of merge commits which become
     degenerate and empty)
+  * rewriting stashes
+  * baking the changes made by refs/replace/ refs into the permanent
+    history and removing the replace refs
   * stripping of original history to avoid mixing old and new history
   * repacking the repository post-rewrite to shrink the repo for the
     user
@@ -379,29 +382,56 @@ directory. These files are overwritten unconditionally on every run.
 Commit map
 ~~~~~~~~~~
 
-The `.git/filter-repo/commit-map` file contains a mapping of how all
+The `$GIT_DIR/filter-repo/commit-map` file contains a mapping of how all
 commits were (or were not) changed.
 
   * A header is the first line with the text "old" and "new"
   * Commit mappings are in no particular order
   * All commits in range of the rewrite will be listed, even commits
-    that are unchanged (e.g. because the commit pre-dated when the
-    large file(s) were introduced to the repo).
+    that are unchanged (e.g. because the commit pre-dated when files
+    the filtering operation are removing were introduced to the repo).
   * An all-zeros hash, or null SHA, represents a non-existent object.
     When in the "new" column, this means the commit was removed
     entirely.
 
 Reference map
 ~~~~~~~~~~~~~
 
-The `.git/filter-repo/ref-map` file contains a mapping of which local
+The `$GIT_DIR/filter-repo/ref-map` file contains a mapping of which local
 references were changed.
 
   * A header is the first line with the text "old", "new" and "ref"
   * Reference mappings are in no particular order
   * An all-zeros hash, or null SHA, represents a non-existent object.
     When in the "new" column, this means the ref was removed entirely.
 
+Already Ran
+~~~~~~~~~~~
+
+The `$GIT_DIR/filter-repo/already_ran` file contains a file recording that
+git-filter-repo has been run.  When this file is present, future runs will
+be treated as an extension of the previous filtering operation.
+
+Concretely, this means:
+  * The "Fresh Clone" check is bypassed
+
+    This is done because past runs would cause the repository to no longer
+    look like a fresh clone, and thus fail the fresh clone check, but doing
+    filtering via multiple invocations of git-filter-repo is an intended
+    and support usecase.  You already passed or bypassed the "Fresh Clone"
+    check on your initial run.
+
+However, if the already_ran file exists but is older than 1 day when they
+invoke git-filter-repo, the user will be prompted for whether the new run
+should be considered a continuation of the previous run.  If they do not
+answer in the affirmative, then the above bullet will not apply.
+This prompt exists because users might do a history rewrite in a repository,
+forget about it and leave the $GIT_DIR/filter-repo directory around, and
+then some months or years later need to do another rewrite.  If commits
+have been made public and shared from the previous rewrite, then the next
+filter-repo run should not be considered a continuation of the previous
+filtering run.
+
 [[FRESHCLONE]]
 FRESH CLONE SAFETY CHECK AND --FORCE
 ------------------------------------

diff --git a/git-filter-repo b/git-filter-repo
@@ -354,17 +354,24 @@ class ProgressWriter(object):
 class _IDs(object):
   """
   A class that maintains the 'name domain' of all the 'marks' (short int
-  id for a blob/commit git object). The reason this mechanism is necessary
-  is because the text of fast-export may refer to an object using a different
-  mark than the mark that was assigned to that object using IDS.new(). This
-  class allows you to translate the fast-export marks (old) to the marks
-  assigned from IDS.new() (new).
-
-  Note that there are two reasons why the marks may differ: (1) The
-  user manually creates Blob or Commit objects (for insertion into the
-  stream) (2) We're reading the data from two different repositories
-  and trying to combine the data (git fast-export will number ids from
-  1...n, and having two 1's, two 2's, two 3's, causes issues).
+  id for a blob/commit git object). There are two reasons this mechanism
+  is necessary:
+    (1) the output text of fast-export may refer to an object using a different
+        mark than the mark that was assigned to that object using IDS.new().
+        (This class allows you to translate the fast-export marks, "old" to
+         the marks assigned from IDS.new(), "new").
+    (2) when we prune a commit, its "old" id becomes invalid.  Any commits
+        which had that commit as a parent needs to use the nearest unpruned
+        ancestor as its parent instead.
+
+  Note that for purpose (1) above, this typically comes about because the user
+  manually creates Blob or Commit objects (for insertion into the stream).
+  It could also come about if we attempt to read the data from two different
+  repositories and trying to combine the data (git fast-export will number ids
+  from 1...n, and having two 1's, two 2's, two 3's, causes issues; granted, we
+  this scheme doesn't handle the two streams perfectly either, but if the first
+  fast export stream is entirely processed and handled before the second stream
+  is started, this mechanism may be sufficient to handle it).
   """
 
   def __init__(self):
@@ -399,7 +406,7 @@ class _IDs(object):
     """
     Record that old_id is being renamed to new_id.
     """
-    if old_id != new_id:
+    if old_id != new_id or old_id in self._translation:
       # old_id -> new_id
       self._translation[old_id] = new_id
 
@@ -434,7 +441,12 @@ class _IDs(object):
       rv += "  %d -> %s\n" % (k, self._translation[k])
 
     rv += "Reverse translation:\n"
-    for k in sorted(self._reverse_translation):
+    reverse_keys = list(self._reverse_translation.keys())
+    if None in reverse_keys: # pragma: no cover
+      reverse_keys.remove(None)
+      reverse_keys = sorted(reverse_keys)
+      reverse_keys.append(None)
+    for k in reverse_keys:
       rv += "  " + str(k) + " -> " + str(self._reverse_translation[k]) + "\n"
 
     return rv
@@ -2935,7 +2947,21 @@ class RepoFilter(object):
 
     # Determine if this is second or later run of filter-repo
     tmp_dir = self.results_tmp_dir(create_if_missing=False)
-    already_ran = os.path.isfile(os.path.join(tmp_dir, b'already_ran'))
+    ran_path = os.path.join(tmp_dir, b'already_ran')
+    already_ran = os.path.isfile(ran_path)
+    if already_ran:
+      current_time = time.time()
+      file_mod_time = os.path.getmtime(ran_path)
+      file_age = current_time - file_mod_time
+      if file_age > 86400: # file older than a day
+        msg = (f"The previous run is older than a day ({decode(ran_path)} already exists).\n"
+               f"See \"Already Ran\" section in the manual for more information.\n"
+               f"Treat this run as a continuation of filtering in the previous run (Y/N)? ")
+        response = input(msg)
+
+        if response.lower() != 'y':
+          os.remove(ran_path)
+          already_ran = False
 
     # Default for --replace-refs
     if not self._args.replace_refs:
@@ -3146,10 +3172,10 @@ class RepoFilter(object):
       return
     fi_input, fi_output = self._import_pipes
     while self._pending_renames:
-      orig_id, ignore = self._pending_renames.popitem(last=False)
-      new_id = fi_output.readline().rstrip()
-      self._commit_renames[orig_id] = new_id
-      if old_hash == orig_id:
+      orig_hash, ignore = self._pending_renames.popitem(last=False)
+      new_hash = fi_output.readline().rstrip()
+      self._commit_renames[orig_hash] = new_hash
+      if old_hash == orig_hash:
         return
       if limit and len(self._pending_renames) < limit:
         return
@@ -3914,7 +3940,7 @@ class RepoFilter(object):
       refs_to_nuke = set()
     if refs_to_nuke and self._args.debug:
       print("[DEBUG] Deleting the following refs:\n  "+
-            decode(b"\n  ".join(refs_to_nuke)))
+            decode(b"\n  ".join(sorted(refs_to_nuke))))
     p.stdin.write(b''.join([b"delete %s\n" % x
                            for x in refs_to_nuke]))
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -6,7 +6,7 @@ authors = [
 ]
 readme = "README.md"
 classifiers = [
-    "Development Status :: 4 - Beta",
+    "Development Status :: 5 - Production/Stable",
     "Operating System :: OS Independent",
     "Programming Language :: Python",
     "License :: OSI Approved :: MIT License",

diff --git a/t/t9393-rerun.sh b/t/t9393-rerun.sh
@@ -0,0 +1,79 @@
+#!/bin/bash
+
+test_description='filter-repo tests with reruns'
+
+. ./test-lib.sh
+
+export PATH=$(dirname $TEST_DIRECTORY):$PATH  # Put git-filter-repo in PATH
+
+DATA="$TEST_DIRECTORY/t9393"
+DELETED_SHA="0000000000000000000000000000000000000000" # FIXME: sha256 support
+
+test_expect_success 'a re-run that is treated as a clean slate' '
+	test_create_repo clean_slate_rerun &&
+	(
+		cd clean_slate_rerun &&
+		git fast-import --quiet <$DATA/simple &&
+
+		FIRST_ORPHAN=$(git rev-parse orphan-me~1) &&
+		FINAL_ORPHAN=$(git rev-parse orphan-me) &&
+		FILE_A_CHANGE=$(git rev-list -1 HEAD -- fileA) &&
+		FILE_B_CHANGE=$(git rev-list -1 HEAD -- fileB) &&
+		FILE_C_CHANGE=$(git rev-list -1 HEAD -- fileC) &&
+		FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) &&
+		ORIGINAL_TAG=$(git rev-parse v1.0) &&
+
+		git filter-repo --invert-paths --path fileB --force &&
+		NEW_FILE_C_CHANGE=$(git rev-list -1 HEAD -- fileC) &&
+		NEW_FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) &&
+		FINAL_TAG=$(git rev-parse v1.0) &&
+
+		cat <<-EOF | sort >sha-expect &&
+		${FIRST_ORPHAN} ${FIRST_ORPHAN}
+		${FINAL_ORPHAN} ${FINAL_ORPHAN}
+		${FILE_A_CHANGE} ${FILE_A_CHANGE}
+		${FILE_B_CHANGE} ${DELETED_SHA}
+		${FILE_C_CHANGE} ${NEW_FILE_C_CHANGE}
+		${FILE_D_CHANGE} ${NEW_FILE_D_CHANGE}
+		EOF
+		printf "%-40s %s\n" old new >expect &&
+		cat sha-expect >>expect &&
+		test_cmp <(sort expect) <(sort .git/filter-repo/commit-map) &&
+
+		cat <<-EOF | sort -k 3 >sha-expect &&
+		${FILE_D_CHANGE} ${NEW_FILE_D_CHANGE} $(git symbolic-ref HEAD)
+		${FINAL_ORPHAN} ${FINAL_ORPHAN} refs/heads/orphan-me
+		${ORIGINAL_TAG} ${FINAL_TAG} refs/tags/v1.0
+		EOF
+		printf "%-40s %-40s %s\n" old new ref >expect &&
+		cat sha-expect >>expect &&
+		test_cmp expect .git/filter-repo/ref-map &&
+
+		touch -t 197001010000 .git/filter-repo/already_ran &&
+		echo no | git filter-repo --invert-paths --path fileC --force &&
+		FINAL_FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) &&
+		REALLY_FINAL_TAG=$(git rev-parse v1.0) &&
+
+		cat <<-EOF | sort >sha-expect &&
+		${FIRST_ORPHAN} ${FIRST_ORPHAN}
+		${FINAL_ORPHAN} ${FINAL_ORPHAN}
+		${FILE_A_CHANGE} ${FILE_A_CHANGE}
+		${NEW_FILE_C_CHANGE} ${DELETED_SHA}
+		${NEW_FILE_D_CHANGE} ${FINAL_FILE_D_CHANGE}
+		EOF
+		printf "%-40s %s\n" old new >expect &&
+		cat sha-expect >>expect &&
+		test_cmp <(sort expect) <(sort .git/filter-repo/commit-map) &&
+
+		cat <<-EOF | sort -k 3 >sha-expect &&
+		${NEW_FILE_D_CHANGE} ${FINAL_FILE_D_CHANGE} $(git symbolic-ref HEAD)
+		${FINAL_ORPHAN} ${FINAL_ORPHAN} refs/heads/orphan-me
+		${FINAL_TAG} ${REALLY_FINAL_TAG} refs/tags/v1.0
+		EOF
+		printf "%-40s %-40s %s\n" old new ref >expect &&
+		cat sha-expect >>expect &&
+		test_cmp expect .git/filter-repo/ref-map
+	)
+'
+
+test_done
diff --git a/t/t9393/simple b/t/t9393/simple
@@ -0,0 +1,96 @@
+feature done
+# Simple repo with a few files, and two branches with no common history.
+# Note that the original-oid directives are very fake, but make it easy to
+# track things.
+blob
+mark :1
+original-oid 0000000000000000000000000000000000000001
+data 16
+file 1 contents
+
+blob
+mark :2
+original-oid 0000000000000000000000000000000000000002
+data 16
+file 2 contents
+
+blob
+mark :3
+original-oid 0000000000000000000000000000000000000003
+data 16
+file 3 contents
+
+blob
+mark :4
+original-oid 0000000000000000000000000000000000000004
+data 16
+file 4 contents
+
+reset refs/heads/orphan-me
+commit refs/heads/orphan-me
+mark :5
+original-oid 0000000000000000000000000000000000000009
+author Little O. Me <[email protected]> 1535228562 -0700
+committer Little O. Me <[email protected]> 1535228562 -0700
+data 8
+Initial
+M 100644 :1 nuke-me
+
+commit refs/heads/orphan-me
+mark :6
+original-oid 000000000000000000000000000000000000000A
+author Little 'ol Me <me@laptop.(none)> 1535229544 -0700
+committer Little 'ol Me <me@laptop.(none)> 1535229544 -0700
+data 9
+Tweak it
+from :5
+M 100644 :4 nuke-me
+
+reset refs/heads/master
+commit refs/heads/master
+mark :7
+original-oid 000000000000000000000000000000000000000B
+author Little O. Me <[email protected]> 1535229523 -0700
+committer Little O. Me <[email protected]> 1535229523 -0700
+data 15
+Initial commit
+M 100644 :1 fileA
+
+commit refs/heads/master
+mark :8
+original-oid 000000000000000000000000000000000000000C
+author Lit.e Me <[email protected]> 1535229559 -0700
+committer Lit.e Me <[email protected]> 1535229580 -0700
+data 10
+Add fileB
+from :7
+M 100644 :2 fileB
+
+commit refs/heads/master
+mark :9
+original-oid 000000000000000000000000000000000000000D
+author Little Me <[email protected]> 1535229601 -0700
+committer Little Me <[email protected]> 1535229601 -0700
+data 10
+Add fileC
+from :8
+M 100644 :3 fileC
+
+commit refs/heads/master
+mark :10
+original-oid 000000000000000000000000000000000000000E
+author Little Me <[email protected]> 1535229618 -0700
+committer Little Me <[email protected]> 1535229618 -0700
+data 10
+Add fileD
+from :9
+M 100644 :4 fileD
+
+tag v1.0
+from :10
+original-oid 000000000000000000000000000000000000000F
+tagger Little John <[email protected]> 1535229637 -0700
+data 5
+v1.0
+
+done