Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] No match results if creating a local_file with only 1 hash in it #1318

Open
Dcallies opened this issue Jun 14, 2023 · 2 comments
Open
Assignees
Labels
bug do-not-reap pdq Items related to the pdq libraries or reference implementations python-threatexchange Items related to the threatexchange python tool / library successful reproduction This bug has a consistent reproduction

Comments

@Dcallies
Copy link
Contributor

Repo:

$ tx hash photo pdq/data/bridge-mods/aaa-orig.jpg >> local_file.txt
$ tx config collab edit local_file file_backed_bank.txt --filename ~/file_backed_bank.txt  --create
$ tx fetch
$ tx match photo pdq/data/bridge-mods/aaa-orig.jpg 

Expected: any matches
However, oddly enough, adding a second hash allows all hashes to to match

$ tx hash photo pdq/data/bridge-mods/blur-a-little.jpg >> local_file.txt
$ tx fetch
$ tx match -A photo pdq/data/bridge-mods/aaa-orig.jpg 
pdq 4 (file_backed_bank.txt) INVESTIGATION_SEED
pdq 0 (file_backed_bank.txt) INVESTIGATION_SEED
@Dcallies Dcallies added bug do-not-reap python-threatexchange Items related to the threatexchange python tool / library successful reproduction This bug has a consistent reproduction labels Jun 14, 2023
@facebook facebook deleted a comment from Yisu123 Jun 14, 2023
@Dcallies Dcallies added the pdq Items related to the pdq libraries or reference implementations label Mar 14, 2024
@Dcallies Dcallies self-assigned this Mar 14, 2024
@jagraff
Copy link
Contributor

jagraff commented Mar 14, 2024

This bug shows up in HMA as well, adding repro in case it's helpful.
For this repro to work, HMA (previously OpenMediaMatch) should be running as a docker container and serving localhost:8080

Reset the tables:

$ docker-compose exec app flask --app OpenMediaMatch.app reset_all_tables
[2024-03-13 18:10:12,303] WARNING in app: No storage class provided, using the default

Create a bank:

$ curl --location 'localhost:8080/c/banks' \
--header 'Content-Type: application/json' \
--data '{
    "name": "EVIL_CONTENT_BANK"
}'
{"matching_enabled_ratio":1.0,"name":"EVIL_CONTENT_BANK"}'

Add a file to the bank:

$ curl --location 'localhost:8080/c/bank/EVIL_CONTENT_BANK/content' \
--form 'photo=@"<photo path>"'
{"id":1,"signals":{"pdq":"3517f92351b0e69170c9656ba70c1249d258926d6d65bd2cbcb49cb34bd1c4fb"}}

Rebuild indexes:

$ docker-compose exec app flask --app OpenMediaMatch.app build_indices
[2024-03-13 18:12:21,582] WARNING in app: No storage class provided, using the default
[2024-03-13 18:12:21,596] INFO in build_index: Running the build_all_indices background task
[2024-03-13 18:12:21,628] INFO in build_index: Building index for pdq (1 signals)
[2024-03-13 18:12:21,630] INFO in build_index: Indexed 1 signals for pdq - 0 seconds
[2024-03-13 18:12:21,631] DEBUG in database: Index[pdq] serializing index to tmpfile /tmp/tmp9_c80zmc
[2024-03-13 18:12:21,631] DEBUG in database: Index[pdq] finished writing to tmpfile, 1 signals 889 bytes - 0 seconds
[2024-03-13 18:12:21,635] DEBUG in database: Index[pdq] imported tmpfile as lobject oid 16750 - 0 seconds
[2024-03-13 18:12:21,635] DEBUG in database: Index[pdq] deallocating old lobject 16747
[2024-03-13 18:12:21,636] DEBUG in database: Index[pdq] cleaned up tmpfile
[2024-03-13 18:12:21,639] INFO in build_index: video_md5 index up to date, no build needed
[2024-03-13 18:12:21,639] INFO in build_index: Completed build_all_indices background task - 0 seconds

Query the bank:

$ curl --location 'localhost:8080/m/lookup?signal_type=pdq&signal=3517f92351b0e69170c9656ba70c1249d258926d6d65bd2cbcb49cb34bd1c4fb'
[]

As you can see, the lookup incorrectly returns no matches even though there should be a match. Adding a second photo, reindexing, and then querying again returns a match:

$ curl --location 'localhost:8080/c/bank/EVIL_CONTENT_BANK/content' \
--form 'photo=@"<second photo path>"'
{"id":2,"signals":{"pdq":"cddcc471737d333771469b9e4c119ce6526e52753f86d1239290469b499941be"}}

$ docker-compose exec app flask --app OpenMediaMatch.app build_indices
[2024-03-13 18:14:20,080] WARNING in app: No storage class provided, using the default
[2024-03-13 18:14:20,093] INFO in build_index: Running the build_all_indices background task
[2024-03-13 18:14:20,124] INFO in build_index: Building index for pdq (3 signals)
[2024-03-13 18:14:20,126] INFO in build_index: Indexed 3 signals for pdq - 0 seconds
[2024-03-13 18:14:20,127] DEBUG in database: Index[pdq] serializing index to tmpfile /tmp/tmp_42xfk4q
[2024-03-13 18:14:20,127] DEBUG in database: Index[pdq] finished writing to tmpfile, 3 signals 1176 bytes - 0 seconds
[2024-03-13 18:14:20,132] DEBUG in database: Index[pdq] imported tmpfile as lobject oid 16751 - 0 seconds
[2024-03-13 18:14:20,132] DEBUG in database: Index[pdq] deallocating old lobject 16750
[2024-03-13 18:14:20,134] DEBUG in database: Index[pdq] cleaned up tmpfile
[2024-03-13 18:14:20,137] INFO in build_index: video_md5 index up to date, no build needed
[2024-03-13 18:14:20,137] INFO in build_index: Completed build_all_indices background task - 0 seconds

$ curl --location 'localhost:8080/m/lookup?signal_type=pdq&signal=3517f92351b0e69170c9656ba70c1249d258926d6d65bd2cbcb49cb34bd1c4fb'
["EVIL_CONTENT_BANK"]

@Dcallies
Copy link
Contributor Author

I suggest we "fix" this by a new, clean reimplementation, described here: #1613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug do-not-reap pdq Items related to the pdq libraries or reference implementations python-threatexchange Items related to the threatexchange python tool / library successful reproduction This bug has a consistent reproduction
Projects
Status: No status
Development

No branches or pull requests

3 participants