Skip to content

Commit 0fa5367

Browse files
committed
Update meaning of {,5} etc to match update in Perl 5.34.0; refactor quantifier parsing
1 parent 09c41a1 commit 0fa5367

17 files changed

+1114
-887
lines changed

ChangeLog

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -109,6 +109,10 @@ the code for extending the heap frames vector. This fixes GitHub issue #275.
109109
28. Update pcre2_fuzzsupport.c to avoid clang sanitize complaint about shifting
110110
left by 16 when there are non-zeros in the top 16 bits.
111111

112+
29. Perl 5.34.0 changed the meaning of (for example) {,3} which did not used to
113+
be treated as a quantifier. Now it is interpreted as {0,3} and PCRE2 has
114+
changed to match. Note that {,} is still not a quantifier.
115+
112116

113117
Version 10.42 11-December-2022
114118
------------------------------

doc/html/pcre2compat.html

Lines changed: 22 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -46,13 +46,18 @@ <h1>pcre2compat man page</h1>
4646
any kind of quantifier on non-lookaround assertions.
4747
</P>
4848
<P>
49-
4. Capture groups that occur inside negative lookaround assertions are counted,
49+
4. If a quantifier appears where there is nothing to repeat (for example, at
50+
the start of a branch), PCRE2 raises an error whereas Perl treats the
51+
quantifier characters as literal.
52+
</P>
53+
<P>
54+
5. Capture groups that occur inside negative lookaround assertions are counted,
5055
but their entries in the offsets vector are set only when a negative assertion
5156
is a condition that has a matching branch (that is, the condition is false).
5257
Perl may set such capture groups in other circumstances.
5358
</P>
5459
<P>
55-
5. The following Perl escape sequences are not supported: \F, \l, \L, \u,
60+
6. The following Perl escape sequences are not supported: \F, \l, \L, \u,
5661
\U, and \N when followed by a character name. \N on its own, matching a
5762
non-newline character, and \N{U+dd..}, matching a Unicode code point, are
5863
supported. The escapes that modify the case of following letters are
@@ -63,7 +68,7 @@ <h1>pcre2compat man page</h1>
6368
interprets them.
6469
</P>
6570
<P>
66-
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
71+
7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 is
6772
built with Unicode support (the default). The properties that can be tested
6873
with \p and \P are limited to the general category properties such as Lu and
6974
Nd, script names such as Greek or Han, Bidi_Class, Bidi_Control, and the
@@ -75,7 +80,7 @@ <h1>pcre2compat man page</h1>
7580
to prefix any of these properties with "Is".
7681
</P>
7782
<P>
78-
7. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
83+
8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
7984
in between are treated as literals. However, this is slightly different from
8085
Perl in that $ and @ are also handled as literals inside the quotes. In Perl,
8186
they cause variable interpolation (PCRE2 does not have variables). Also, Perl
@@ -96,19 +101,19 @@ <h1>pcre2compat man page</h1>
96101
by both PCRE2 and Perl.
97102
</P>
98103
<P>
99-
8. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
104+
9. Fairly obviously, PCRE2 does not support the (?{code}) and (??{code})
100105
constructions. However, PCRE2 does have a "callout" feature, which allows an
101106
external function to be called during pattern matching. See the
102107
<a href="pcre2callout.html"><b>pcre2callout</b></a>
103108
documentation for details.
104109
</P>
105110
<P>
106-
9. Subroutine calls (whether recursive or not) were treated as atomic groups up
111+
10. Subroutine calls (whether recursive or not) were treated as atomic groups up
107112
to PCRE2 release 10.23, but from release 10.30 this changed, and backtracking
108113
into subroutine calls is now supported, as in Perl.
109114
</P>
110115
<P>
111-
10. In PCRE2, if any of the backtracking control verbs are used in a group that
116+
11. In PCRE2, if any of the backtracking control verbs are used in a group that
112117
is called as a subroutine (whether or not recursively), their effect is
113118
confined to that group; it does not extend to the surrounding pattern. This is
114119
not always the case in Perl. In particular, if (*THEN) is present in a group
@@ -117,20 +122,20 @@ <h1>pcre2compat man page</h1>
117122
processed as anchored at the point where they are tested.
118123
</P>
119124
<P>
120-
11. If a pattern contains more than one backtracking control verb, the first
125+
12. If a pattern contains more than one backtracking control verb, the first
121126
one that is backtracked onto acts. For example, in the pattern
122127
A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure in C
123128
triggers (*PRUNE). Perl's behaviour is more complex; in many cases it is the
124129
same as PCRE2, but there are cases where it differs.
125130
</P>
126131
<P>
127-
12. There are some differences that are concerned with the settings of captured
132+
13. There are some differences that are concerned with the settings of captured
128133
strings when part of a pattern is repeated. For example, matching "aba" against
129134
the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE2 it is set to
130135
"b".
131136
</P>
132137
<P>
133-
13. PCRE2's handling of duplicate capture group numbers and names is not as
138+
14. PCRE2's handling of duplicate capture group numbers and names is not as
134139
general as Perl's. This is a consequence of the fact the PCRE2 works internally
135140
just with numbers, using an external table to translate between numbers and
136141
names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b&#62;B)), where the two
@@ -140,34 +145,34 @@ <h1>pcre2compat man page</h1>
140145
number 1. To avoid this confusing situation, an error is given at compile time.
141146
</P>
142147
<P>
143-
14. Perl used to recognize comments in some places that PCRE2 does not, for
148+
15. Perl used to recognize comments in some places that PCRE2 does not, for
144149
example, between the ( and ? at the start of a group. If the /x modifier is
145150
set, Perl allowed white space between ( and ? though the latest Perls give an
146151
error (for a while it was just deprecated). There may still be some cases where
147152
Perl behaves differently.
148153
</P>
149154
<P>
150-
15. Perl, when in warning mode, gives warnings for character classes such as
155+
16. Perl, when in warning mode, gives warnings for character classes such as
151156
[A-\d] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE2 has no
152157
warning features, so it gives an error in these cases because they are almost
153158
certainly user mistakes.
154159
</P>
155160
<P>
156-
16. In PCRE2, the upper/lower case character properties Lu and Ll are not
161+
17. In PCRE2, the upper/lower case character properties Lu and Ll are not
157162
affected when case-independent matching is specified. For example, \p{Lu}
158163
always matches an upper case letter. I think Perl has changed in this respect;
159164
in the release at the time of writing (5.34), \p{Lu} and \p{Ll} match all
160165
letters, regardless of case, when case independence is specified.
161166
</P>
162167
<P>
163-
17. From release 5.32.0, Perl locks out the use of \K in lookaround
168+
18. From release 5.32.0, Perl locks out the use of \K in lookaround
164169
assertions. From release 10.38 PCRE2 does the same by default. However, there
165170
is an option for re-enabling the previous behaviour. When this option is set,
166171
\K is acted on when it occurs in positive assertions, but is ignored in
167172
negative assertions.
168173
</P>
169174
<P>
170-
18. PCRE2 provides some extensions to the Perl regular expression facilities.
175+
19. PCRE2 provides some extensions to the Perl regular expression facilities.
171176
Perl 5.10 included new features that were not in earlier versions of Perl, some
172177
of which (such as named parentheses) were in PCRE2 for some time before. This
173178
list is with respect to Perl 5.34:
@@ -219,7 +224,7 @@ <h1>pcre2compat man page</h1>
219224
lookarounds are atomic.
220225
</P>
221226
<P>
222-
19. Perl has different limits than PCRE2. See the
227+
20. Perl has different limits than PCRE2. See the
223228
<a href="pcre2limit.html"><b>pcre2limit</b></a>
224229
documentation for details. Perl went with 5.10 from recursion to iteration
225230
keeping the intermediate matches on the heap, which is ~10% slower but does not
@@ -241,7 +246,7 @@ <h1>pcre2compat man page</h1>
241246
REVISION
242247
</b><br>
243248
<P>
244-
Last updated: 11 August 2023
249+
Last updated: 13 September 2023
245250
<br>
246251
Copyright &copy; 1997-2023 University of Cambridge.
247252
<br>

doc/html/pcre2pattern.html

Lines changed: 32 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -323,7 +323,7 @@ <h1>pcre2pattern man page</h1>
323323
* 0 or more quantifier
324324
+ 1 or more quantifier; also "possessive quantifier"
325325
? 0 or 1 quantifier; also quantifier minimizer
326-
{ start min/max quantifier
326+
{ potential start of min/max quantifier
327327
</pre>
328328
Part of a pattern that is in square brackets is called a "character class". In
329329
a character class the only metacharacters are:
@@ -1914,24 +1914,25 @@ <h1>pcre2pattern man page</h1>
19141914
</P>
19151915
<br><a name="SEC17" href="#TOC1">REPETITION</a><br>
19161916
<P>
1917-
Repetition is specified by quantifiers, which can follow any of the following
1917+
Repetition is specified by quantifiers, which may follow any one of these
19181918
items:
19191919
<pre>
19201920
a literal data character
19211921
the dot metacharacter
19221922
the \C escape sequence
19231923
the \R escape sequence
19241924
the \X escape sequence
1925-
an escape such as \d or \pL that matches a single character
1925+
any escape sequence that matches a single character
19261926
a character class
19271927
a backreference
19281928
a parenthesized group (including lookaround assertions)
19291929
a subroutine call (recursive or otherwise)
19301930
</pre>
1931-
The general repetition quantifier specifies a minimum and maximum number of
1932-
permitted matches, by giving the two numbers in curly brackets (braces),
1933-
separated by a comma. The numbers must be less than 65536, and the first must
1934-
be less than or equal to the second. For example,
1931+
If a quantifier does not follow a repeatabke item, an error occurs. The
1932+
general repetition quantifier specifies a minimum and maximum number of
1933+
permitted matches by giving two numbers in curly brackets (braces), separated
1934+
by a comma. The numbers must be less than 65536, and the first must be less
1935+
than or equal to the second. For example,
19351936
<pre>
19361937
z{2,4}
19371938
</pre>
@@ -1946,10 +1947,20 @@ <h1>pcre2pattern man page</h1>
19461947
<pre>
19471948
\d{8}
19481949
</pre>
1949-
matches exactly 8 digits. An opening curly bracket that appears in a position
1950-
where a quantifier is not allowed, or one that does not match the syntax of a
1951-
quantifier, is taken as a literal character. For example, {,6} is not a
1952-
quantifier, but a literal string of four characters.
1950+
matches exactly 8 digits. If the first number is omitted, the lower limit is
1951+
taken as zero; in this case the upper limit must be present.
1952+
<pre>
1953+
X{,4} is interpreted as X{0,4}
1954+
</pre>
1955+
This is a change in behaviour that happened in Perl 5.34.0 and PCRE2 10.43. In
1956+
earlier versions such a sequence was not interpreted as a quantifier. If the
1957+
characters that follow an opening brace do not match the syntax of a
1958+
quantifier, the brace is taken as a literal character. In particular, this
1959+
means that {,} is a literal string of three characters.
1960+
</P>
1961+
<P>
1962+
Note that not every opening brace is potentially the start of a quantifier
1963+
because braces are used in other items such as \N{U+345} or \k{name}.
19531964
</P>
19541965
<P>
19551966
In UTF modes, quantifiers apply to characters rather than to individual code
@@ -1990,11 +2001,11 @@ <h1>pcre2pattern man page</h1>
19902001
</P>
19912002
<P>
19922003
By default, quantifiers are "greedy", that is, they match as much as possible
1993-
(up to the maximum number of permitted times), without causing the rest of the
1994-
pattern to fail. The classic example of where this gives problems is in trying
1995-
to match comments in C programs. These appear between /* and */ and within the
1996-
comment, individual * and / characters may appear. An attempt to match C
1997-
comments by applying the pattern
2004+
(up to the maximum number of permitted repetitions), without causing the rest
2005+
of the pattern to fail. The classic example of where this gives problems is in
2006+
trying to match comments in C programs. These appear between /* and */ and
2007+
within the comment, individual * and / characters may appear. An attempt to
2008+
match C comments by applying the pattern
19982009
<pre>
19992010
/\*.*\*/
20002011
</pre>
@@ -2009,10 +2020,10 @@ <h1>pcre2pattern man page</h1>
20092020
<pre>
20102021
/\*.*?\*/
20112022
</pre>
2012-
does the right thing with the C comments. The meaning of the various
2013-
quantifiers is not otherwise changed, just the preferred number of matches.
2014-
Do not confuse this use of question mark with its use as a quantifier in its
2015-
own right. Because it has two uses, it can sometimes appear doubled, as in
2023+
does the right thing with C comments. The meaning of the various quantifiers is
2024+
not otherwise changed, just the preferred number of matches. Do not confuse
2025+
this use of question mark with its use as a quantifier in its own right.
2026+
Because it has two uses, it can sometimes appear doubled, as in
20162027
<pre>
20172028
\d??\d
20182029
</pre>
@@ -3781,7 +3792,7 @@ <h1>pcre2pattern man page</h1>
37813792
</P>
37823793
<br><a name="SEC32" href="#TOC1">REVISION</a><br>
37833794
<P>
3784-
Last updated: 11 August 2023
3795+
Last updated: 13 September 2023
37853796
<br>
37863797
Copyright &copy; 1997-2023 University of Cambridge.
37873798
<br>

doc/html/pcre2syntax.html

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -305,6 +305,9 @@ <h1>pcre2syntax man page</h1>
305305
{n,} n or more, greedy
306306
{n,}+ n or more, possessive
307307
{n,}? n or more, lazy
308+
{,m} zero up to m, greedy
309+
{,m}+ zero up to m, possessive
310+
{,m}? zero up to m, lazy
308311
</PRE>
309312
</P>
310313
<br><a name="SEC12" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
@@ -604,7 +607,7 @@ <h1>pcre2syntax man page</h1>
604607
</P>
605608
<br><a name="SEC31" href="#TOC1">REVISION</a><br>
606609
<P>
607-
Last updated: 11 August 2023
610+
Last updated: 13 September 2023
608611
<br>
609612
Copyright &copy; 1997-2023 University of Cambridge.
610613
<br>

doc/pcre2-config.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
12
PCRE2-CONFIG(1) General Commands Manual PCRE2-CONFIG(1)
23

34

0 commit comments

Comments
 (0)