Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add better URL detection in ClickableHook #136

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Add better URL detection in ClickableHook #136

wants to merge 2 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Dec 19, 2017

This will detect full URLs, currently it seems only the first part until the slash gets matched with that current regex.

@milesj
Copy link
Owner

milesj commented Dec 20, 2017

This would theoretically allow invalid URLs.

@ghost
Copy link
Author

ghost commented Dec 20, 2017

You current approach isn't better. It's not like there there would be a buffer overflow if there's an invalid URL. OK, I'll try to make something better.

@ghost
Copy link
Author

ghost commented Dec 20, 2017

Here, I've fixed it.

@milesj
Copy link
Owner

milesj commented Dec 21, 2017

Do you have example URLs that currently don't pass the validation?

@ghost
Copy link
Author

ghost commented Dec 21, 2017

Why don't you want to merge it? FILTER_VALIDATE_URL checks URLs strictly after RFC 2396 I doubt you can archive that level of validation with a regex. It's probably also much faster. It's much easier to maintain and read than that regex you use. This will also fix #129 and #130 (with bracket support).

@HDVinnie
Copy link
Contributor

I have to agree with @Hyleus here @milesj . This is a much much better approach.

@milesj
Copy link
Owner

milesj commented Dec 21, 2017

My only issue with the implementation is the massive O(N) loop at the beginning which traverses the entire content char-by-char. This has negative performance against large posts and or multiple decoda instances on the page.

I think a better approach would be to search for $split_chars first, and then iterate based on the found indices. Something like Decoda's lexer: https://github.com/milesj/decoda/blob/master/src/Decoda/Decoda.php#L1408

@ghost
Copy link
Author

ghost commented Dec 21, 2017

But isn't mb_strpos O(N) as well for 1-char strings?

@milesj
Copy link
Owner

milesj commented Dec 21, 2017

Yes but native code > user-land code in most situations IMO.

@ghost
Copy link
Author

ghost commented Dec 22, 2017

Running time phpunit:

improve-url-matcher:
real	0m0.191s
user	0m0.159s
sys	0m0.031s

master:
real	0m0.192s
user	0m0.151s
sys	0m0.039s

So you see there's no big difference here, run it multiple times and it'll average the same probably.

@milesj
Copy link
Owner

milesj commented Dec 23, 2017

You're testing in a best/perfect case, which is irrelevant to Big-O. Let me put it this way. You're current loop would do a strpos() lookup for every single character in the content, so if there's 10,000 characters, that's 10,000 strpos() lookups. If you do the inverse, it's multitudes less lookups. The big hurdle is space detection, which you can probably solve with split + splice.

@ghost
Copy link
Author

ghost commented Dec 23, 2017

Yeah, but we need to search for multiple split chars, so in the end if we determine the split_chars first it would have the same performance, because we would no matter what search for every split char in the string. No, the amount of lookups doesn't change, we would still run it 5 times for the split_chars, because there are 5 split chars.

@milesj
Copy link
Owner

milesj commented Dec 23, 2017

True, what about preg_match + PREG_OFFSET_CAPTURE?

@ghost
Copy link
Author

ghost commented Dec 27, 2017

I think the regex would still work the same way as my loop internally. Or do you think there would be a difference?

@milesj
Copy link
Owner

milesj commented Jan 2, 2018

My only issue is that we're looping over the the entire content twice now, once for the initial lexer, and again for this hook. That's quite a change.

I agree that this is a better way of finding URLs, I'm just worried about the perf changes. Ideally this would hook into the lexer somehow but that's probably more work than necessary.

@dereuromark
Copy link
Collaborator

Could this be a feature flag/option to turn on if you want it? Then people could decide on their own if it is worth sacrificing a bit of runtime in favor of more correct parsing?
Just an idea.

@alquerci
Copy link
Contributor

alquerci commented Sep 21, 2020

@dereuromark

Could this be a feature flag/option to turn on if you want it?

Yes, it can be a solution.


Firstly there is no test cases on this PR, first step is to add a failed test then make test pass then test the performance impact #136 (comment)

Secondly there is maybe a better and more elegant way to achieve the same stuff but here there is the most important thing that is missing => the way to reproduce the issue. So it need at least one day of focus on it.

Both issues mentioned are already fixed.

This will also fix #129 and #130 (with bracket support).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants