Qt wiki will be updated on October 12th 2023 starting at 11:30 AM (EEST) and the maintenance will last around 2-3 hours. During the maintenance the site will be unavailable.

Regexp engine in Qt5: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
No edit summary
(Removed the cleanup tag as the page is now correct)
 
(6 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{Cleanup | reason=Auto-imported from ExpressionEngine.}}
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]]


[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]]
[toc align_right="yes" depth="3"]


= Regular expression engine in Qt5 =
= Regular expression engine in Qt5 =


This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.
This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.
* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
* http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html
* https://bugreports.qt.nokia.com/browse/QTBUG-20888
* https://bugreports.qt.io/browse/QTBUG-20888


== Current issues with QRegExp ==
== Current issues with QRegExp ==


From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html


=== High level issues ===
=== High level issues ===
Line 19: Line 17:
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm ECMA-262-1999] . Missing features include
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm ECMA-262-1999] . Missing features include
** Non-greedy quantifiers (see page 141 titled "- 129 -")
** Non-greedy quantifiers (see page 141 titled "- 129 -")
'''''' But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
** But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
* Patternist/XPath also needs Regex features not found in QRegExp, including
* Patternist/XPath also needs Regex features not found in QRegExp, including
'''''' Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] …
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] …
* Customer complained about QRegExp (though I don't see what's their exact problem):
* Customer complained about QRegExp (though I don't see what's their exact problem):
Line 37: Line 35:
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T4''': lookbehind is not supported (lookahead is)
* '''T4''': lookbehind is not supported (lookahead is)
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".'''" returns 3, not 0
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
''' '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.


Line 68: Line 66:
! RE2
! RE2
|-
|-
! General comments|See above.
! General comments
| See above.
|  
|  
|  
|  
Line 183: Line 182:
! RE2
! RE2
|-
|-
!  
! \a
|BELL
|BELL
|✔
|✔
Line 193: Line 192:
|  
|  
|-
|-
!  
! \A
|beginning of input
|beginning of input
|  
|  
Line 203: Line 202:
|  
|  
|-
|-
! inside a [set]
! \b inside a [set]
|BACKSPACE
|BACKSPACE
|?
|?
Line 213: Line 212:
|  
|  
|-
|-
! outside a [set]
! \b outside a [set]
|on a word boundary
|on a word boundary
|✔
|✔
Line 223: Line 222:
|  
|  
|-
|-
!  
! \B
|not on a word boundary
|not on a word boundary
|✔
|✔
Line 233: Line 232:
|  
|  
|-
|-
!  
! \cX
|ASCII control character X
|ASCII control character X
|✘
|✘
Line 243: Line 242:
|  
|  
|-
|-
!  
! \d
|digit
|digit
|✔
|✔
Line 253: Line 252:
|  
|  
|-
|-
!  
! \D
|non digit
|non digit
|✔
|✔
Line 263: Line 262:
|  
|  
|-
|-
!  
! \e
|ESCAPE
|ESCAPE
|✘
|✘
Line 273: Line 272:
|  
|  
|-
|-
!  
! \E
|end of … quoting
|end of \Q \E quoting
|✘
|✘
|✔
|✔
Line 283: Line 282:
|  
|  
|-
|-
!  
! \f
|FORM FEED
|FORM FEED
|✔
|✔
Line 293: Line 292:
|  
|  
|-
|-
!  
! \G
|end of previous match
|end of previous match
|✘
|✘
Line 303: Line 302:
|  
|  
|-
|-
!  
! \n
|LINE FEED
|LINE FEED
|✔
|✔
Line 313: Line 312:
|  
|  
|-
|-
! {x}
! \N{x}
|UNICODE CHARACTER NAME x
|UNICODE CHARACTER NAME x
|✘
|✘
Line 323: Line 322:
|  
|  
|-
|-
! {x}
! \p{x}
|UNICODE PROPERTY NAME x
|UNICODE PROPERTY NAME x
|✘
|✘
Line 333: Line 332:
|  
|  
|-
|-
! {x}
! \P{x}
|UNICODE PROPERTY NAME not x
|UNICODE PROPERTY NAME not x
|✘
|✘
Line 343: Line 342:
|  
|  
|-
|-
!  
! \Q
|start of … quoting
|start of \Q \E quoting
|✘
|✘
|✔
|✔
Line 353: Line 352:
|  
|  
|-
|-
!  
! \r
|CARRIAGE RETURN
|CARRIAGE RETURN
|✔
|✔
Line 363: Line 362:
|  
|  
|-
|-
!  
! \s
|white space
|white space
|✔
|✔
Line 373: Line 372:
|  
|  
|-
|-
!  
! \S
|non white space
|non white space
|✔
|✔
Line 383: Line 382:
|  
|  
|-
|-
!  
! \t
|HORIZONTAL TAB
|HORIZONTAL TAB
|✔
|✔
Line 393: Line 392:
|  
|  
|-
|-
!  
! \uhhhh
|U+hhhh (between U+0000 and U+FFFF)
|U+hhhh (between U+0000 and U+FFFF)
|✘
|✘
Line 403: Line 402:
|  
|  
|-
|-
!  
! \Uhhhhhhhh
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|✘
|✘
Line 413: Line 412:
|  
|  
|-
|-
! VERTICAL TAB
! \v
| VERTICAL TAB
|✔
|✔
|✔
|✔
Line 422: Line 422:
|  
|  
|-
|-
!  
! \w
|word character
|word character
|✔
|✔
Line 432: Line 432:
|  
|  
|-
|-
!  
! \W
|non word character
|non word character
|✔
|✔
Line 442: Line 442:
|  
|  
|-
|-
! {hhhh}
! \x{hhhh}
|U+hhhh
|U+hhhh
|✘
|✘
Line 452: Line 452:
|  
|  
|-
|-
!  
! \xhhhh
|U+hhhh
|U+hhhh
|✔ (0000-FFFF)
|✔ (0000-FFFF)
Line 462: Line 462:
|  
|  
|-
|-
!  
! \X
|grapheme cluster
|grapheme cluster
|✘
|✘
Line 472: Line 472:
|  
|  
|-
|-
!  
! \Z
|end of input (or before the final )
|end of input (or before the final )
|✘
|✘
Line 482: Line 482:
|  
|  
|-
|-
!  
! \z
|end of input
|end of input
|✘
|✘
Line 492: Line 492:
|  
|  
|-
|-
!  
! \n
|n-th backreference
|n-th backreference
|✔
|✔
Line 502: Line 502:
|  
|  
|-
|-
! ooo
! \0ooo
|ASCII/Latin-1 character 0ooo
|ASCII/Latin-1 character 0ooo
|✔
|✔
Line 542: Line 542:
|  
|  
|-
|-
! quote the following symbol
! \
|quote the following symbol
|✔
|✔
|✔
|✔
Line 562: Line 563:
|}
|}


=== Supported syntax: Operators ===


=== Supported syntax: Operators ===
{| class="wikitable"
|''. Operator |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|
! Operator  
|''. * |match 0 or more times|✔|✔| |✔|?|?|✔|
!
|''. + |match 1 or more times|✔|✔| |✔| | |✔|
! QRegExp  
|''. ? |match 0 or 1 times|✔|✔| |✔| | |✔|
! PCRE
|''. {n} |match n times|✔|✔| |✔| | |✔|
! V8
|''. {n,} |match n or more times|✔|✔| |✔| | |✔|
! ICU
|''. {n,m} |match between n and m times|✔|✔| |✔| | |✔|
! Boost.Regex
|''. *? |match 0 or more times, not greedy|✘|✔| |✔| | |✔|
! std::regex
|''. ''? |match 1 or more times, not greedy|✘|✔| |✔| | |✔|
! RE2
|''. ?? |match 0 or 1 times, not greedy|✘|✔| |✔| | |✔|
|-
|''. {n}? |match n times|✘|✔| |✔| | |✔|
! *  
|''. {n,}? |match n or more times, not greedy|✘|✔| |✔| | |✔|
|match 0 or more times
|''. {n,m}? |match between n and m times, not greedy|✘|✔| |✔| | |✔|
|✔
|''. *+ |match 0 or more times, possessive|✘|✔| |✔| | |✘|
|✔
|''.''+ |match 1 or more times, possessive|✘|✔| |✔| | |✘|
|  
|''. ?'' |match 0 or 1 times, possessive|✘|✔| |✔| | |✘|
|✔
|''. {n}+ |match n times|✘|✔| |✔| | |✘|
|?
|''. {n,}+ |match n or more times, possessive|✘|✔| |✔| | |✘|
|?
|''. {n,m}+ |match between n and m times, possessive|✘|✔| |✔| | |✘|
|✔
|''. ( … ) |capturing group|✔|✔| |✔| | |✔|
|-
|''. (?: … ) |group|✔|✔| |✔| | |✔|
! +  
|''. (?> … ) |atomic grouping|✘|✔| |✔| | |✘|
|match 1 or more times
|''. (?# … ) |comment|✘|✔| |✔| | |✘|
|✔
|''. (?= … ) |look-ahead assertion|✔|✔| |✔| | |✘|
|✔
|''. (?! … ) |negative look-ahead assertion|✔|✔| |✔| | |✘|
|  
|''. (?<= … ) |look-behind assertion|✘|✔| |✔| | |✘|
|✔
|''. (?<! … ) |negative look-behind assertion|✘|✔| |✔| | |✘|
|  
|''. (?flags: … ) |flags change|✘|✔| | | | |✔|
|  
|''. (?flags) |flags change|✘|✔| |✔| | |✔|
|✔
|''. (?P<name> …) |named capturing group|✘|✔| |✘| | |✔|
|-
|''. (?<name> …) |named capturing group|✘|✔| |✘| | |✘|
! ?  
|''. (?'name' …) |named capturing group|✘|✔| |✘| | |✘|
|match 0 or 1 times
|''. (?| …) |branch reset|✘|✔| |✘| | |✘|
|✔
|''. | | | | | | | | |
|✔
|''. | | | | | | | | |
|  
|✔
|  
|  
|✔
|-
! {n}  
|match n times
|✔
|✔
|  
|✔
|  
|  
|✔
|-
! {n,}  
|match n or more times
|✔
|✔
|  
|✔
|  
|  
|✔
|-
! {n,m}  
|match between n and m times
|✔
|✔
|  
|✔
|  
|  
|✔
|-
! *?  
|match 0 or more times, not greedy
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! ''?  
|match 1 or more times, not greedy
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! ??  
|match 0 or 1 times, not greedy
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! {n}?  
|match n times
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! {n,}?  
|match n or more times, not greedy
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! {n,m}?  
|match between n and m times, not greedy
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! *+  
|match 0 or more times, possessive
|✘
|✔
|  
|✔
|  
|  
|✘
|-
!''+  
|match 1 or more times, possessive
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! ?''  
|match 0 or 1 times, possessive
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! {n}+  
|match n times
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! {n,}+  
|match n or more times, possessive
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! {n,m}+  
|match between n and m times, possessive
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! ( … )  
|capturing group
|✔
|✔
|  
|✔
|  
|  
|✔
|-
! (?: … )  
|group
|✔
|✔
|  
|✔
|  
|  
|✔
|-
! (?> … )  
|atomic grouping
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! (?# … )  
|comment
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! (?= … )  
|look-ahead assertion
|✔
|✔
|
|✔
|  
|  
|✘
|-
! (?! … )  
|negative look-ahead assertion
|✔
|✔
|  
|✔
|  
|  
|✘
|-
! (?<= … )  
|look-behind assertion
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! (?<! … )  
|negative look-behind assertion
|✘
|✔
|  
|✔
|  
|  
|✘
|-
! (?flags: … )  
|flags change
|✘
|✔
|  
|  
|  
|  
|✔
|-
! (?flags)  
|flags change
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! (?P<name> …)  
|named capturing group
|✘
|✔
|  
|✘
|  
|  
|✔
|-
! (?<name> …)  
|named capturing group
|✘
|✔
|  
|✘
|  
|  
|✘
|-
! (?'name' …)  
|named capturing group
|✘
|✔
|  
|✘
|  
|  
|✘
|-
! (?| …)  
|branch reset
|✘
|✔
|  
|✘
|  
|  
|✘
|}




=== Supported syntax: flags ===
=== Supported syntax: flags ===
|''. Flag |''. |''. QRegExp |''. PCRE|''. V8|''. ICU|''. Boost.Regex|''. std::regex|''. RE2|
 
|''. /i |case insensitive|✔|✔| |✔| | |✔|
{| class="wikitable"
|''. /m |multi-line|✘|✔| |✔| | |✔|
! Flag  
|''. /s |dot matches anything|~<ref>It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.</ref> |✔| |✔| | |✔|
!
|''. /x |ignore whitespace and comments|✘|✔| |✔| | |✔|
! QRegExp  
|_. /U |minimal match|✔|✔| |✘| | |✔|
! PCRE
! V8
! ICU
! Boost.Regex
! std::regex
! RE2
|-
! /i  
|case insensitive
|✔
|✔
|  
|✔
|  
|  
|✔
|-
! /m  
|multi-line
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! /s  
|dot matches anything
|~<ref>It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.</ref>  
|✔
|  
|✔
|  
|  
|✔
|-
! /x  
|ignore whitespace and comments
|✘
|✔
|  
|✔
|  
|  
|✔
|-
! /U  
|minimal match
|✔
|✔
|  
|✘
|  
|  
|✔
|}


= Benchmarks =
= Benchmarks =

Latest revision as of 15:41, 22 May 2015


Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
    • Non-greedy quantifiers (see page 141 titled "- 129 -")
    • But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
  • Patternist/XPath also needs Regex features not found in QRegExp, including
  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem
  • Customer complained about QRegExp (though I don't see what's their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
  • T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

Feature matrix

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
General comments See above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Does not run on every platform supported by QtCore / QtBase[2]. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn't have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ ("by hand", with partial matching) Maybe yes, see UText . ✔ see StringPiece
Fixes T7

✘✔


Supported syntax: Characters

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
\a BELL
\A beginning of input
\b inside a [set] BACKSPACE ?
\b outside a [set] on a word boundary
\B not on a word boundary
\cX ASCII control character X
\d digit
\D non digit
\e ESCAPE
\E end of \Q … \E quoting
\f FORM FEED
\G end of previous match
\n LINE FEED
\N{x} UNICODE CHARACTER NAME x
\p{x} UNICODE PROPERTY NAME x
\P{x} UNICODE PROPERTY NAME not x
\Q start of \Q … \E quoting
\r CARRIAGE RETURN
\s white space
\S non white space
\t HORIZONTAL TAB
\uhhhh U+hhhh (between U+0000 and U+FFFF)
\Uhhhhhhhh U+hhhhhhhh (between U+00000000 and U+0010FFFF)
\v VERTICAL TAB
\w word character
\W non word character
\x{hhhh} U+hhhh ✔ (0-10FFFF) ✔ (0-10FFFF)
\xhhhh U+hhhh ✔ (0000-FFFF) ✔ (00-FF) ✔ (00-FF)
\X grapheme cluster
\Z end of input (or before the final )
\z end of input
\n n-th backreference
\0ooo ASCII/Latin-1 character 0ooo
. any character but newlines
^ line beginning
$ line end
\ quote the following symbol
[pattern] set

Supported syntax: Operators

Operator QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
* match 0 or more times ? ?
+ match 1 or more times
? match 0 or 1 times
{n} match n times
{n,} match n or more times
{n,m} match between n and m times
*? match 0 or more times, not greedy
? match 1 or more times, not greedy
?? match 0 or 1 times, not greedy
{n}? match n times
{n,}? match n or more times, not greedy
{n,m}? match between n and m times, not greedy
*+ match 0 or more times, possessive
+ match 1 or more times, possessive
? match 0 or 1 times, possessive
{n}+ match n times
{n,}+ match n or more times, possessive
{n,m}+ match between n and m times, possessive
( … ) capturing group
(?: … ) group
(?> … ) atomic grouping
(?# … ) comment
(?= … ) look-ahead assertion
(?! … ) negative look-ahead assertion
(?<= … ) look-behind assertion
(?<! … ) negative look-behind assertion
(?flags: … ) flags change
(?flags) flags change
(?P<name> …) named capturing group
(?<name> …) named capturing group
(?'name' …) named capturing group
…) branch reset


Supported syntax: flags

Flag QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
/i case insensitive
/m multi-line
/s dot matches anything ~[1]
/x ignore whitespace and comments
/U minimal match

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.