Skip to content

Commit

Permalink
Merge branch 'en/important-misc-fixes'
Browse files Browse the repository at this point in the history
Several fixes, from miscellaneous documentation improvements, up to a
gnarly bug in _IDs.record_rename(), and in important improvement in how
already_ran is interpreted if left around.

Signed-off-by: Elijah Newren <[email protected]>
  • Loading branch information
newren committed Oct 21, 2024
2 parents db7e07d + 8a243ae commit 9388cd4
Show file tree
Hide file tree
Showing 5 changed files with 255 additions and 24 deletions.
38 changes: 34 additions & 4 deletions Documentation/git-filter-repo.txt
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ can be overridden, but they are all on by default):
* pruning commits which become empty due to the above filters (also
handles edge cases like pruning of merge commits which become
degenerate and empty)
* rewriting stashes
* baking the changes made by refs/replace/ refs into the permanent
history and removing the replace refs
* stripping of original history to avoid mixing old and new history
* repacking the repository post-rewrite to shrink the repo for the
user
Expand Down Expand Up @@ -379,29 +382,56 @@ directory. These files are overwritten unconditionally on every run.
Commit map
~~~~~~~~~~

The `.git/filter-repo/commit-map` file contains a mapping of how all
The `$GIT_DIR/filter-repo/commit-map` file contains a mapping of how all
commits were (or were not) changed.

* A header is the first line with the text "old" and "new"
* Commit mappings are in no particular order
* All commits in range of the rewrite will be listed, even commits
that are unchanged (e.g. because the commit pre-dated when the
large file(s) were introduced to the repo).
that are unchanged (e.g. because the commit pre-dated when files
the filtering operation are removing were introduced to the repo).
* An all-zeros hash, or null SHA, represents a non-existent object.
When in the "new" column, this means the commit was removed
entirely.

Reference map
~~~~~~~~~~~~~

The `.git/filter-repo/ref-map` file contains a mapping of which local
The `$GIT_DIR/filter-repo/ref-map` file contains a mapping of which local
references were changed.

* A header is the first line with the text "old", "new" and "ref"
* Reference mappings are in no particular order
* An all-zeros hash, or null SHA, represents a non-existent object.
When in the "new" column, this means the ref was removed entirely.

Already Ran
~~~~~~~~~~~

The `$GIT_DIR/filter-repo/already_ran` file contains a file recording that
git-filter-repo has been run. When this file is present, future runs will
be treated as an extension of the previous filtering operation.

Concretely, this means:
* The "Fresh Clone" check is bypassed

This is done because past runs would cause the repository to no longer
look like a fresh clone, and thus fail the fresh clone check, but doing
filtering via multiple invocations of git-filter-repo is an intended
and support usecase. You already passed or bypassed the "Fresh Clone"
check on your initial run.

However, if the already_ran file exists but is older than 1 day when they
invoke git-filter-repo, the user will be prompted for whether the new run
should be considered a continuation of the previous run. If they do not
answer in the affirmative, then the above bullet will not apply.
This prompt exists because users might do a history rewrite in a repository,
forget about it and leave the $GIT_DIR/filter-repo directory around, and
then some months or years later need to do another rewrite. If commits
have been made public and shared from the previous rewrite, then the next
filter-repo run should not be considered a continuation of the previous
filtering run.

[[FRESHCLONE]]
FRESH CLONE SAFETY CHECK AND --FORCE
------------------------------------
Expand Down
64 changes: 45 additions & 19 deletions git-filter-repo
Original file line number Diff line number Diff line change
Expand Up @@ -354,17 +354,24 @@ class ProgressWriter(object):
class _IDs(object):
"""
A class that maintains the 'name domain' of all the 'marks' (short int
id for a blob/commit git object). The reason this mechanism is necessary
is because the text of fast-export may refer to an object using a different
mark than the mark that was assigned to that object using IDS.new(). This
class allows you to translate the fast-export marks (old) to the marks
assigned from IDS.new() (new).
Note that there are two reasons why the marks may differ: (1) The
user manually creates Blob or Commit objects (for insertion into the
stream) (2) We're reading the data from two different repositories
and trying to combine the data (git fast-export will number ids from
1...n, and having two 1's, two 2's, two 3's, causes issues).
id for a blob/commit git object). There are two reasons this mechanism
is necessary:
(1) the output text of fast-export may refer to an object using a different
mark than the mark that was assigned to that object using IDS.new().
(This class allows you to translate the fast-export marks, "old" to
the marks assigned from IDS.new(), "new").
(2) when we prune a commit, its "old" id becomes invalid. Any commits
which had that commit as a parent needs to use the nearest unpruned
ancestor as its parent instead.
Note that for purpose (1) above, this typically comes about because the user
manually creates Blob or Commit objects (for insertion into the stream).
It could also come about if we attempt to read the data from two different
repositories and trying to combine the data (git fast-export will number ids
from 1...n, and having two 1's, two 2's, two 3's, causes issues; granted, we
this scheme doesn't handle the two streams perfectly either, but if the first
fast export stream is entirely processed and handled before the second stream
is started, this mechanism may be sufficient to handle it).
"""

def __init__(self):
Expand Down Expand Up @@ -399,7 +406,7 @@ class _IDs(object):
"""
Record that old_id is being renamed to new_id.
"""
if old_id != new_id:
if old_id != new_id or old_id in self._translation:
# old_id -> new_id
self._translation[old_id] = new_id

Expand Down Expand Up @@ -434,7 +441,12 @@ class _IDs(object):
rv += " %d -> %s\n" % (k, self._translation[k])

rv += "Reverse translation:\n"
for k in sorted(self._reverse_translation):
reverse_keys = list(self._reverse_translation.keys())
if None in reverse_keys: # pragma: no cover
reverse_keys.remove(None)
reverse_keys = sorted(reverse_keys)
reverse_keys.append(None)
for k in reverse_keys:
rv += " " + str(k) + " -> " + str(self._reverse_translation[k]) + "\n"

return rv
Expand Down Expand Up @@ -2935,7 +2947,21 @@ class RepoFilter(object):

# Determine if this is second or later run of filter-repo
tmp_dir = self.results_tmp_dir(create_if_missing=False)
already_ran = os.path.isfile(os.path.join(tmp_dir, b'already_ran'))
ran_path = os.path.join(tmp_dir, b'already_ran')
already_ran = os.path.isfile(ran_path)
if already_ran:
current_time = time.time()
file_mod_time = os.path.getmtime(ran_path)
file_age = current_time - file_mod_time
if file_age > 86400: # file older than a day
msg = (f"The previous run is older than a day ({decode(ran_path)} already exists).\n"
f"See \"Already Ran\" section in the manual for more information.\n"
f"Treat this run as a continuation of filtering in the previous run (Y/N)? ")
response = input(msg)

if response.lower() != 'y':
os.remove(ran_path)
already_ran = False

# Default for --replace-refs
if not self._args.replace_refs:
Expand Down Expand Up @@ -3146,10 +3172,10 @@ class RepoFilter(object):
return
fi_input, fi_output = self._import_pipes
while self._pending_renames:
orig_id, ignore = self._pending_renames.popitem(last=False)
new_id = fi_output.readline().rstrip()
self._commit_renames[orig_id] = new_id
if old_hash == orig_id:
orig_hash, ignore = self._pending_renames.popitem(last=False)
new_hash = fi_output.readline().rstrip()
self._commit_renames[orig_hash] = new_hash
if old_hash == orig_hash:
return
if limit and len(self._pending_renames) < limit:
return
Expand Down Expand Up @@ -3914,7 +3940,7 @@ class RepoFilter(object):
refs_to_nuke = set()
if refs_to_nuke and self._args.debug:
print("[DEBUG] Deleting the following refs:\n "+
decode(b"\n ".join(refs_to_nuke)))
decode(b"\n ".join(sorted(refs_to_nuke))))
p.stdin.write(b''.join([b"delete %s\n" % x
for x in refs_to_nuke]))

Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ authors = [
]
readme = "README.md"
classifiers = [
"Development Status :: 4 - Beta",
"Development Status :: 5 - Production/Stable",
"Operating System :: OS Independent",
"Programming Language :: Python",
"License :: OSI Approved :: MIT License",
Expand Down
79 changes: 79 additions & 0 deletions t/t9393-rerun.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/bin/bash

test_description='filter-repo tests with reruns'

. ./test-lib.sh

export PATH=$(dirname $TEST_DIRECTORY):$PATH # Put git-filter-repo in PATH

DATA="$TEST_DIRECTORY/t9393"
DELETED_SHA="0000000000000000000000000000000000000000" # FIXME: sha256 support

test_expect_success 'a re-run that is treated as a clean slate' '
test_create_repo clean_slate_rerun &&
(
cd clean_slate_rerun &&
git fast-import --quiet <$DATA/simple &&
FIRST_ORPHAN=$(git rev-parse orphan-me~1) &&
FINAL_ORPHAN=$(git rev-parse orphan-me) &&
FILE_A_CHANGE=$(git rev-list -1 HEAD -- fileA) &&
FILE_B_CHANGE=$(git rev-list -1 HEAD -- fileB) &&
FILE_C_CHANGE=$(git rev-list -1 HEAD -- fileC) &&
FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) &&
ORIGINAL_TAG=$(git rev-parse v1.0) &&
git filter-repo --invert-paths --path fileB --force &&
NEW_FILE_C_CHANGE=$(git rev-list -1 HEAD -- fileC) &&
NEW_FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) &&
FINAL_TAG=$(git rev-parse v1.0) &&
cat <<-EOF | sort >sha-expect &&
${FIRST_ORPHAN} ${FIRST_ORPHAN}
${FINAL_ORPHAN} ${FINAL_ORPHAN}
${FILE_A_CHANGE} ${FILE_A_CHANGE}
${FILE_B_CHANGE} ${DELETED_SHA}
${FILE_C_CHANGE} ${NEW_FILE_C_CHANGE}
${FILE_D_CHANGE} ${NEW_FILE_D_CHANGE}
EOF
printf "%-40s %s\n" old new >expect &&
cat sha-expect >>expect &&
test_cmp <(sort expect) <(sort .git/filter-repo/commit-map) &&
cat <<-EOF | sort -k 3 >sha-expect &&
${FILE_D_CHANGE} ${NEW_FILE_D_CHANGE} $(git symbolic-ref HEAD)
${FINAL_ORPHAN} ${FINAL_ORPHAN} refs/heads/orphan-me
${ORIGINAL_TAG} ${FINAL_TAG} refs/tags/v1.0
EOF
printf "%-40s %-40s %s\n" old new ref >expect &&
cat sha-expect >>expect &&
test_cmp expect .git/filter-repo/ref-map &&
touch -t 197001010000 .git/filter-repo/already_ran &&
echo no | git filter-repo --invert-paths --path fileC --force &&
FINAL_FILE_D_CHANGE=$(git rev-list -1 HEAD -- fileD) &&
REALLY_FINAL_TAG=$(git rev-parse v1.0) &&
cat <<-EOF | sort >sha-expect &&
${FIRST_ORPHAN} ${FIRST_ORPHAN}
${FINAL_ORPHAN} ${FINAL_ORPHAN}
${FILE_A_CHANGE} ${FILE_A_CHANGE}
${NEW_FILE_C_CHANGE} ${DELETED_SHA}
${NEW_FILE_D_CHANGE} ${FINAL_FILE_D_CHANGE}
EOF
printf "%-40s %s\n" old new >expect &&
cat sha-expect >>expect &&
test_cmp <(sort expect) <(sort .git/filter-repo/commit-map) &&
cat <<-EOF | sort -k 3 >sha-expect &&
${NEW_FILE_D_CHANGE} ${FINAL_FILE_D_CHANGE} $(git symbolic-ref HEAD)
${FINAL_ORPHAN} ${FINAL_ORPHAN} refs/heads/orphan-me
${FINAL_TAG} ${REALLY_FINAL_TAG} refs/tags/v1.0
EOF
printf "%-40s %-40s %s\n" old new ref >expect &&
cat sha-expect >>expect &&
test_cmp expect .git/filter-repo/ref-map
)
'

test_done
96 changes: 96 additions & 0 deletions t/t9393/simple
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
feature done
# Simple repo with a few files, and two branches with no common history.
# Note that the original-oid directives are very fake, but make it easy to
# track things.
blob
mark :1
original-oid 0000000000000000000000000000000000000001
data 16
file 1 contents

blob
mark :2
original-oid 0000000000000000000000000000000000000002
data 16
file 2 contents

blob
mark :3
original-oid 0000000000000000000000000000000000000003
data 16
file 3 contents

blob
mark :4
original-oid 0000000000000000000000000000000000000004
data 16
file 4 contents

reset refs/heads/orphan-me
commit refs/heads/orphan-me
mark :5
original-oid 0000000000000000000000000000000000000009
author Little O. Me <[email protected]> 1535228562 -0700
committer Little O. Me <[email protected]> 1535228562 -0700
data 8
Initial
M 100644 :1 nuke-me

commit refs/heads/orphan-me
mark :6
original-oid 000000000000000000000000000000000000000A
author Little 'ol Me <me@laptop.(none)> 1535229544 -0700
committer Little 'ol Me <me@laptop.(none)> 1535229544 -0700
data 9
Tweak it
from :5
M 100644 :4 nuke-me

reset refs/heads/master
commit refs/heads/master
mark :7
original-oid 000000000000000000000000000000000000000B
author Little O. Me <[email protected]> 1535229523 -0700
committer Little O. Me <[email protected]> 1535229523 -0700
data 15
Initial commit
M 100644 :1 fileA

commit refs/heads/master
mark :8
original-oid 000000000000000000000000000000000000000C
author Lit.e Me <[email protected]> 1535229559 -0700
committer Lit.e Me <[email protected]> 1535229580 -0700
data 10
Add fileB
from :7
M 100644 :2 fileB

commit refs/heads/master
mark :9
original-oid 000000000000000000000000000000000000000D
author Little Me <[email protected]> 1535229601 -0700
committer Little Me <[email protected]> 1535229601 -0700
data 10
Add fileC
from :8
M 100644 :3 fileC

commit refs/heads/master
mark :10
original-oid 000000000000000000000000000000000000000E
author Little Me <[email protected]> 1535229618 -0700
committer Little Me <[email protected]> 1535229618 -0700
data 10
Add fileD
from :9
M 100644 :4 fileD

tag v1.0
from :10
original-oid 000000000000000000000000000000000000000F
tagger Little John <[email protected]> 1535229637 -0700
data 5
v1.0

done

0 comments on commit 9388cd4

Please sign in to comment.