Add better URL detection in ClickableHook #136

ghost · 2017-12-19T18:16:49Z

This will detect full URLs, currently it seems only the first part until the slash gets matched with that current regex.

milesj · 2017-12-20T00:01:02Z

This would theoretically allow invalid URLs.

ghost · 2017-12-20T08:12:05Z

You current approach isn't better. It's not like there there would be a buffer overflow if there's an invalid URL. OK, I'll try to make something better.

ghost · 2017-12-20T08:24:52Z

Here, I've fixed it.

milesj · 2017-12-21T07:04:57Z

Do you have example URLs that currently don't pass the validation?

ghost · 2017-12-21T08:06:39Z

Why don't you want to merge it? FILTER_VALIDATE_URL checks URLs strictly after RFC 2396 I doubt you can archive that level of validation with a regex. It's probably also much faster. It's much easier to maintain and read than that regex you use. This will also fix #129 and #130 (with bracket support).

HDVinnie · 2017-12-21T15:28:08Z

I have to agree with @Hyleus here @milesj . This is a much much better approach.

milesj · 2017-12-21T19:00:26Z

My only issue with the implementation is the massive O(N) loop at the beginning which traverses the entire content char-by-char. This has negative performance against large posts and or multiple decoda instances on the page.

I think a better approach would be to search for $split_chars first, and then iterate based on the found indices. Something like Decoda's lexer: https://github.com/milesj/decoda/blob/master/src/Decoda/Decoda.php#L1408

ghost · 2017-12-21T19:45:23Z

But isn't mb_strpos O(N) as well for 1-char strings?

milesj · 2017-12-21T22:05:20Z

Yes but native code > user-land code in most situations IMO.

ghost · 2017-12-22T09:55:44Z

Running time phpunit:

improve-url-matcher:
real	0m0.191s
user	0m0.159s
sys	0m0.031s

master:
real	0m0.192s
user	0m0.151s
sys	0m0.039s

So you see there's no big difference here, run it multiple times and it'll average the same probably.

milesj · 2017-12-23T03:03:31Z

You're testing in a best/perfect case, which is irrelevant to Big-O. Let me put it this way. You're current loop would do a strpos() lookup for every single character in the content, so if there's 10,000 characters, that's 10,000 strpos() lookups. If you do the inverse, it's multitudes less lookups. The big hurdle is space detection, which you can probably solve with split + splice.

ghost · 2017-12-23T07:46:55Z

Yeah, but we need to search for multiple split chars, so in the end if we determine the split_chars first it would have the same performance, because we would no matter what search for every split char in the string. No, the amount of lookups doesn't change, we would still run it 5 times for the split_chars, because there are 5 split chars.

milesj · 2017-12-23T21:41:34Z

True, what about preg_match + PREG_OFFSET_CAPTURE?

ghost · 2017-12-27T19:04:37Z

I think the regex would still work the same way as my loop internally. Or do you think there would be a difference?

milesj · 2018-01-02T19:23:40Z

My only issue is that we're looping over the the entire content twice now, once for the initial lexer, and again for this hook. That's quite a change.

I agree that this is a better way of finding URLs, I'm just worried about the perf changes. Ideally this would hook into the lexer somehow but that's probably more work than necessary.

dereuromark · 2020-09-19T13:21:47Z

Could this be a feature flag/option to turn on if you want it? Then people could decide on their own if it is worth sacrificing a bit of runtime in favor of more correct parsing?
Just an idea.

alquerci · 2020-09-21T18:34:52Z

@dereuromark

Could this be a feature flag/option to turn on if you want it?

Yes, it can be a solution.

Firstly there is no test cases on this PR, first step is to add a failed test then make test pass then test the performance impact #136 (comment)

Secondly there is maybe a better and more elegant way to achieve the same stuff but here there is the most important thing that is missing => the way to reproduce the issue. So it need at least one day of focus on it.

Both issues mentioned are already fixed.

This will also fix #129 and #130 (with bracket support).

Hyleus added 2 commits December 21, 2017 19:15

Only allow valid URLs in ClickableHook

0508cda

Fix the regex

4e0637b

dereuromark added the improvement label Sep 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add better URL detection in ClickableHook #136

Add better URL detection in ClickableHook #136

ghost commented Dec 19, 2017

milesj commented Dec 20, 2017

ghost commented Dec 20, 2017

ghost commented Dec 20, 2017

milesj commented Dec 21, 2017

ghost commented Dec 21, 2017 •

edited by ghost

Loading

HDVinnie commented Dec 21, 2017

milesj commented Dec 21, 2017 •

edited

Loading

ghost commented Dec 21, 2017 •

edited by ghost

Loading

milesj commented Dec 21, 2017

ghost commented Dec 22, 2017

milesj commented Dec 23, 2017

ghost commented Dec 23, 2017

milesj commented Dec 23, 2017

ghost commented Dec 27, 2017

milesj commented Jan 2, 2018

dereuromark commented Sep 19, 2020

alquerci commented Sep 21, 2020 •

edited

Loading

Add better URL detection in ClickableHook #136

Are you sure you want to change the base?

Add better URL detection in ClickableHook #136

Conversation

ghost commented Dec 19, 2017

milesj commented Dec 20, 2017

ghost commented Dec 20, 2017

ghost commented Dec 20, 2017

milesj commented Dec 21, 2017

ghost commented Dec 21, 2017 • edited by ghost Loading

HDVinnie commented Dec 21, 2017

milesj commented Dec 21, 2017 • edited Loading

ghost commented Dec 21, 2017 • edited by ghost Loading

milesj commented Dec 21, 2017

ghost commented Dec 22, 2017

milesj commented Dec 23, 2017

ghost commented Dec 23, 2017

milesj commented Dec 23, 2017

ghost commented Dec 27, 2017

milesj commented Jan 2, 2018

dereuromark commented Sep 19, 2020

alquerci commented Sep 21, 2020 • edited Loading

ghost commented Dec 21, 2017 •

edited by ghost

Loading

milesj commented Dec 21, 2017 •

edited

Loading

ghost commented Dec 21, 2017 •

edited by ghost

Loading

alquerci commented Sep 21, 2020 •

edited

Loading