Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an "unmunger" (second attempt) #84

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

Ronan-H
Copy link

@Ronan-H Ronan-H commented Mar 9, 2023

For issue #45, and a complete rewrite of my first attempt a few years ago: #55

Instead of splitting by regex, I'm trying out using a trie here. Going character by character, that lets you identify possible substitutions cleanly in linear time. I think it's also more complete in terms of edge cases (ie. where a substitution key is a substring of another sub key)

Example part of the overall trie:
image

Performance

In order to exhaust all possible substitutions though you do still have to build all combinations out recursively, so I think you can't escape the exponential runtime there. That's why the hard limit on total number of combinations is important. I've noticed though for very long password lengths, if you just copy paste some of the string substitution keys over and over, there's a slowdown. It can take more than a full second to estimate long passwords on my machine. But I suspect the source of the problem might be that this alg is running for every possible substring of the password on top. I'm not fully familiar with the overall structure and intentions of the project but maybe the generation of substrings and this new alg should be merged into one overall alg rather than multiplying eachother, if that makes sense.

@MrWook
Copy link

MrWook commented Apr 3, 2023

@Tostino you mentioned this implementation in the typescript repo.
I converted it to typescript, and it was really slow on my side at the beginning.

With the string "P4$$w0rd" it takes over 1000ms before it was a few ms for the same string.
The difference is that the previous implementation only took the complete translation and not the individual options as in the new implementation.
This way we got 16 translation for the new implementation and only 1 for the old one. This is mainly because the levenshtein matcher is kind of heavy which makes this really slow

I got around this issue a little bit by reversing the finalPasswords array and stop the loop if we got a full password match.
In this case the password is P4$$w0rd and the last entry of the finalPasswords array is password which is direct match in the password dictionary. But this only works for single word passwords.

This whole performance drop doesn't matter for random strings like 4@8({[</369&#!1/|0$5+7%2/4@8({[</369&#!1/|0$5+7%2/" or y9CxMeBod*qx@xoPFaLYE9yk_sJGMY-RdxbiGVjEK9g9!o.97!qENngzN8gbMYgHZD9dEu@[email protected]@jd.!D*hTm7JnVFFxLX but it is getting really bad with real multi word passwords like |-|or5e5+ableB4++ery.
This one is getting 320 translation which will be used for the normal dictionary matcher. I already use the levenshtein distance on a full password only but as the translations are full passwords it will be triggered 320 times.
So the last thing that I did, was using the levenshtein distance only on the first entry of the reverse array from finalPasswords which is a full substituted string. So the levenshtein distance will be only triggered on horsestablebattery.

Those are a few learnings with the new implementation. Maybe this is just the JS implementation, I don't know anything about the java port.
But maybe this will help you a little bit

TLTR:

  • reverse finalPasswords list => i think it is more likely to find a full substituted string instead of of a string where only the first char was changed.
  • skip others entries if we already found a full password
    For example if we revert the list for P4$$w0rd, the first entry is password which can be found in the password dictionary. If we already have this we don't need to search for others.
  • use levenshtein distance only on the full substituted entry => with P4$$w0rd only use password

@Ronan-H thanks for this awesome implementation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants