-
Notifications
You must be signed in to change notification settings - Fork 321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py-tx] Add a new match command line option (--rotations) #1672
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your test plan, showing the ouput of the match function will be critical. Here's what I want to see:
Using https://github.com/facebook/ThreatExchange/blob/main/pdq/data/bridge-mods/aaa-orig.jpg
# Showing hash of original
$ tx hash photo aaa-orig.jpg
pdq facefacefafce...
# Here we use imageMagick, but any editor will work
$ magick aaa-org.jpg -rotate 90 aaa-90.jpg
# index sample data, which includes aaa-orig
$ tx fetch
# Now match against rotated
$ tx match photo aaa-90.jpg
pdq sample 0[90º] INGVESTIGATION_SEED
You can also consider adding --rotations to tx hash
as well, which will be simpler.
# Add rotation information if possible | ||
rotation_info = "" | ||
if hasattr(r.similarity_info, "rotation"): | ||
rotation_info = f" [{r.similarity_info.rotation.name}]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While clever, and valid python, this will make typechecking unhappy (try running mypy locally as described in https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/CONTRIBUTING.md).
Instead, change the return type of these functions from IndexMatch to a new object that contains an index match and the rotation info.
|
||
# Add rotation information if any matches were found | ||
for match in matches: | ||
match.similarity_info.rotation = rotation_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: I think this might be an illegal operation, and I'm unsure whether we should add rotation to the base concept of similarity.
We have a few options, but the simplest one might be change the return of this function from IndexMatch to instead return a new dataclass that might contains the rotation information.
I'll accept any solution that:
- Doesn't change the base interface (since I don't think we yet know how to handle rotations in general - you are pioneering our approach!)
- Can show which rotation was used in
tx match
.
…xchange into rotation-cmd Merge the new commits from photo-helpers branch to this branch.
results: t.Sequence[IndexMatchWithRotation] = [ | ||
IndexMatchWithRotation(match=match) | ||
for match in results_from_hashes | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a new dataclass IndexMatchWithRotation which will be the return type of _match_file.
Q: @Dcallies Should I also make it the return type of _match_hashes? Just so the return types of these 2 are similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the python typing pass you may need to be consistent between them (see what mypy says)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress! Starting to get close.
- Let's not print the rotation unless --rotations is provided - this is why your tests are failing that check the output
- Some notes inline about how to handle that.
- If you haven't done it yet, it will be easier to add --rotations to
tx hash
than it will be to match.
results: t.Sequence[IndexMatchWithRotation] = [ | ||
IndexMatchWithRotation(match=match) | ||
for match in results_from_hashes | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the python typing pass you may need to be consistent between them (see what mypy says)
@@ -82,7 +83,7 @@ def flip_minus1(cls, image_data: bytes) -> bytes: | |||
return buffer.getvalue() | |||
|
|||
@classmethod | |||
def all_simple_rotations(cls, image_data: bytes): | |||
def all_simple_rotations(cls, image_data: bytes) -> t.Dict[RotationType, bytes]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
@dataclass | ||
class IndexMatchWithRotation(t.Generic[T]): | ||
match: IndexMatchUntyped[SignalSimilarityInfo, T] | ||
rotation_type: RotationType = RotationType.ORIGINAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you move it into the match command, consider making this optional, then adding a __str__
command which handles the output of the match command. If rotation_type is None, then no rotation is printed. If the rotation is there, then it can format the string you put to string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer, but maybe let's break down the problem a little more.
It may help to break this down into two PRs - one for the hash
command, and one for the match command.
Hash will be more straightforward!
In order to do that, you'll need to use some git tricks to "split" the stack, which can be daunting.
One way we could do that is the following (ref)
- Checkout the
main
branch withgit checkout main
- Create a new branch for only the hash command
git checkout -b hash_cmd_rotations
- Pull only the contents of the hash command to your new branch
git checkout photo-helpers -- python-threatexchange/threatexchange/cli/hash_cmd.py
- Create a new commit on our new branch `git commit -am '[pytx] add --rotations to hash_cmd'
- Push this branch to start a new PR
from threatexchange.content_type.video import VideoContent | ||
|
||
|
||
class TestHashCommand(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: We are using the py.test framework for this library, not unittest.
Check out: https://docs.pytest.org/en/stable/
and this reference test: https://github.com/haianhng31/ThreatExchange/blob/8d10599cda78afa49f763c08ca556db94f4ae332/python-threatexchange/threatexchange/cli/tests/match_cmd_test.py
We should probably just use the hash and match_cmd tests rather than creating new tests.
@@ -7,7 +7,7 @@ | |||
This records all the valid signal types for a piece of content. | |||
""" | |||
|
|||
from enum import Enum, auto | |||
from enum import Enum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job splitting your PR! This is quite difficult to do any many people have a lot more trouble their first time!
|
||
for rotation_type, rotated_bytes in rotated_images.items(): | ||
# Create a temporary file to hold the image bytes | ||
with tempfile.NamedTemporaryFile() as temp_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: This is going to write a lot of files over the course of an execution! We are going to call this method for every single match type.
Another approach you could do is refactor this so that the rotated images are inserted higher up in the stack, and then rather than taking rotations: bool, you could pass in the path of the rotation, and an optional enum representing which enum it can take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of change s_type to be subclass of BytesHasher so that I can use bytes directly with hash_from_bytes
. All the tests passed and mypy doesn't complain. What do you think of this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the last PR, I mentioned the downside - not all photo formats are knowable without the extension. There's likely a workaround, but let's stay with the current course and see if we can fix it in a followup instead.
@@ -196,18 +230,23 @@ def execute(self, settings: CLISettings) -> None: | |||
for s_type, index in indices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before processing this list of files, if rotations is true here, generate new files that you iterate through.
Here's one way to do that
def handle_rotations() -> Iterator[(Path, t.Optional[RotationType)]:
for file in self.files:
if not self.rotations:
yield file, None
continue
for rot, dat in PhotoContent.blah():
with NamedTemporary(...) as f:
yield Path(f.name), rot
for path in handle_rotations(self.files):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All that's missing now:
- We should update the unittest for match to also include a case for --rotations
- From your test plan it looks like it's not matching hashes that it should, since the rotated image should match the rotated hash.
Here's a e2e test case:
$ threatexchange hash --rotations photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true | grep FLIPX | cut -d ' ' -f 2- >> ~/file.txt
$ threatexchange config collab edit local_file --filename ~/file.txt 'file.txt' --create
$ threatexchange fetch
$ threatexchange match photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
# No match
$ threatexchange match --rotations photo https://github.com/facebook/ThreatExchange/blob/main/pdq/data/misc-images/b.jpg?raw=true
pdq - (file.txt) INVESTIGATION_SEED
074e3bf
to
e367f61
Compare
def test_photo_hash_with_rotations(self): | ||
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg") | ||
|
||
hash_cmd = ThreatExchangeCLIE2eHelper() | ||
hash_cmd.COMMON_CALL_ARGS = ("hash",) | ||
hash_cmd._state_dir = pathlib.Path() | ||
|
||
hash = hash_cmd.cli_call("photo", str(test_file)) | ||
assert hash == "pdq fb4eed46cb8a6c78819ca06b756c541f7b07ef6d02c82fccd00f862166272cda\n" | ||
|
||
# rotated_images = PhotoContent.all_simple_rotations(test_file.read_bytes()) | ||
|
||
# img = rotated_images[RotationType.ROTATE90] #try with 1 rotated image first | ||
|
||
# with tempfile.NamedTemporaryFile() as tmp_file: | ||
# img = rotated_images[RotationType.ROTATE90] | ||
# tmp_file.write(img) | ||
# self.assert_cli_output( | ||
# ("--rotations", "photo", "--", tmp_file.name), | ||
# "video_md5 - (Sample Signals) INVESTIGATION_SEED", | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current approach for this test:
- Hash an image
- Fetch
- Call
match --rotations
the rotated version of the original image
Q: @Dcallies
- I got the hash pdq of the image (line 53) but how do i fetch indexes and call match command on this hash?
- If there is a match found, what should the output look like? When I try it with the CLI it keeps output
pdq 16 (Sample Signals) INVESTIGATION_SEED
or if i'm using the config collab:
pdq 0 (file.txt) INVESTIGATION_SEED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a match, then the output is the matching collaboration and distance.
Consider this ouptput:
pdq 16 (Sample Signals) INVESTIGATION_SEED
signal_type | distance | collab name | confidence |
---|---|---|---|
pdq | 16 | Sample Signals | INVESTIGATION_SEED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I did:
First I sanity checked that we expect the produced hashes to match
$ tx hash --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg
ORIGINAL pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
ROTATE90 pdq 30a10efd71cc3d429013d48d0ffffc52e34e0e17ada952a9d29685211ea9e5af
ROTATE180 pdq adad5a64b5a142e75b62a09857da895ae63b847fc23794b766b319361bc93188
ROTATE270 pdq a5f0a457a48995e8c9065c275aaa5498b61ba4bdf8fcf80387c32f8b1bfc4f05
FLIPX pdq f8f80f31e0f417b20e37f5cd028f980fb36ed02a9662c1e233e64c634e9c64dd
FLIPY pdq 0dad2599b1a1bd1a5362576742da32a5e63b7380c2374b4866b366c91bc9ce77
FLIPPLUS1 pdq f0a5e102f1ccc0bd945308720fff038de34ef1e8ada9a956d2967ade5ea91a50
FLIPMINUS1 pdq a5f05aa8a4896a17c906a2d85aaaab07b61b5b42f8fc07fc87c3d0741bfcb0fa
# Check original
$ tx match photo -H f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
pdq 16 (Sample Signals) INVESTIGATION_SEED
# Note - this should be distance 0, but it's stopping at the first match, which is one of the variations
# Check with rotation
$ tx match --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg
pdq 16 (Sample Signals) INVESTIGATION_SEED
However, I can see an error here - why isn't the rotation printed? When I looked to answer this question I found a bug in your code, which I found by adding some helpful print statements. I believe you can find that bug too!
@Dcallies |
Hey @haianhng31, what do you mean? The hash command and the match command are different commands, and so one should not break the other. My guess is that you mean that the output of the rotation type makes it not read correctly from a file. Take a closer look at the command I gave you before:
Let me try patching your code on my machine locally to see if I can see what you mean... |
def test_photo_hash_with_rotations(self): | ||
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg") | ||
|
||
hash_cmd = ThreatExchangeCLIE2eHelper() | ||
hash_cmd.COMMON_CALL_ARGS = ("hash",) | ||
hash_cmd._state_dir = pathlib.Path() | ||
|
||
hash = hash_cmd.cli_call("photo", str(test_file)) | ||
assert hash == "pdq fb4eed46cb8a6c78819ca06b756c541f7b07ef6d02c82fccd00f862166272cda\n" | ||
|
||
# rotated_images = PhotoContent.all_simple_rotations(test_file.read_bytes()) | ||
|
||
# img = rotated_images[RotationType.ROTATE90] #try with 1 rotated image first | ||
|
||
# with tempfile.NamedTemporaryFile() as tmp_file: | ||
# img = rotated_images[RotationType.ROTATE90] | ||
# tmp_file.write(img) | ||
# self.assert_cli_output( | ||
# ("--rotations", "photo", "--", tmp_file.name), | ||
# "video_md5 - (Sample Signals) INVESTIGATION_SEED", | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I did:
First I sanity checked that we expect the produced hashes to match
$ tx hash --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg
ORIGINAL pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
ROTATE90 pdq 30a10efd71cc3d429013d48d0ffffc52e34e0e17ada952a9d29685211ea9e5af
ROTATE180 pdq adad5a64b5a142e75b62a09857da895ae63b847fc23794b766b319361bc93188
ROTATE270 pdq a5f0a457a48995e8c9065c275aaa5498b61ba4bdf8fcf80387c32f8b1bfc4f05
FLIPX pdq f8f80f31e0f417b20e37f5cd028f980fb36ed02a9662c1e233e64c634e9c64dd
FLIPY pdq 0dad2599b1a1bd1a5362576742da32a5e63b7380c2374b4866b366c91bc9ce77
FLIPPLUS1 pdq f0a5e102f1ccc0bd945308720fff038de34ef1e8ada9a956d2967ade5ea91a50
FLIPMINUS1 pdq a5f05aa8a4896a17c906a2d85aaaab07b61b5b42f8fc07fc87c3d0741bfcb0fa
# Check original
$ tx match photo -H f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
pdq 16 (Sample Signals) INVESTIGATION_SEED
# Note - this should be distance 0, but it's stopping at the first match, which is one of the variations
# Check with rotation
$ tx match --rotations photo /workspaces/devcontainer-ThreatExchange/pdq/data/bridge-mods/aaa-orig.jpg
pdq 16 (Sample Signals) INVESTIGATION_SEED
However, I can see an error here - why isn't the rotation printed? When I looked to answer this question I found a bug in your code, which I found by adding some helpful print statements. I believe you can find that bug too!
) | ||
|
||
def test_photo_hash_with_rotations(self): | ||
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Danger! This photo might not correspond to one of the sample signals. Check out the comments on https://github.com/facebook/ThreatExchange/blob/main/python-threatexchange/threatexchange/signal_type/pdq/signal.py#L90
Additionally, this makes the test not work if run from different directories! Here's how I fixed it:
test_file = pathlib.Path("threatexchange/tests/hashing/resources/rgb.jpeg") | |
test_file = pathlib.Path( | |
__file__ + "../../../../../../pdq/data/bridge-mods/aaa-orig.jpg" | |
).resolve() |
Summary
Files changed:
Question:
Test Plan