Qt wiki will be updated on October 12th 2023 starting at 11:30 AM (EEST) and the maintenance will last around 2-3 hours. During the maintenance the site will be unavailable.

Regexp engine in Qt5: Difference between revisions

From Qt Wiki
Jump to navigation Jump to search
No edit summary
 
(Removed the cleanup tag as the page is now correct)
 
(14 intermediate revisions by 4 users not shown)
Line 1: Line 1:
=Regular expression engine in Qt5=
[[Category:Developing_Qt::Qt Planning::Qt Public Roadmap]]


This page summarizes the research for an alternative regular expression engine to be used in [[Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.


* http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001004.html
= Regular expression engine in Qt5 =
* https://bugreports.qt.nokia.com/browse/QTBUG-20888


==Current issues with QRegExp==
This page summarizes the research for an alternative regular expression engine to be used in [[Qt_5.0 | Qt 5.0]]. The discussion started on the qt5-feedback mailing list, cf.
* http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001004.html
* https://bugreports.qt.io/browse/QTBUG-20888


From http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-September/001054.html
== Current issues with QRegExp ==


===High level issues===
From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html


* QRegExp <span class="caps">API</span> is broken (see '''T7''' in Low Level)
=== High level issues ===
* QRegExp is used for QtScript though it does not fullfill the <span class="caps">ECMAS</span>cript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm <span class="caps">ECMA</span>-262-1999] ''[ecma-international.org]'' . Missing features include
 
** Non-greedy quantifiers (see page 141 titled - 129 -)
* QRegExp API is broken (see '''T7''' in Low Level)
** But current implementation of QtScript uses <span class="caps">JSC</span> which uses its own engine anyway, and only use QRegExp in its api as a container.
* QRegExp is used for QtScript though it does not fullfill the ECMAScript specification [http://www.ecma-international.org/publications/standards/Ecma-262.htm ECMA-262-1999] . Missing features include
** Non-greedy quantifiers (see page 141 titled "- 129 -")
** But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
* Patternist/XPath also needs Regex features not found in QRegExp, including
* Patternist/XPath also needs Regex features not found in QRegExp, including
** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
** Non-greedy quantifiers ( http://www.w3.org/TR/xpath-functions/#regex-syntax )
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] ''[bugzilla.gnome.org]''
* Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of '''T6''' described below. GtkSourceView has exactly [http://bugzilla.gnome.org/show_bug.cgi?id=134674#c1 that problem] …
* Customer complained about QRegExp (though I don’t see what’s their exact problem):
* Customer complained about QRegExp (though I don't see what's their exact problem):
** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the <span class="caps">PCRE</span> (Perl Compatible Regular Expressions) library.''
** ''In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.''
* Public task request:
* Public task request:
** Lookbehind ('''T4''') ([http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry bug 217916] ''[trolltech.com]'')
** Lookbehind ('''T4''') ([http://trolltech.com/developer/task-tracker/index_html?id=217916&method=entry bug 217916])
** Support for <span class="caps">POSIX</span> syntax ([http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry bug 218604] ''[trolltech.com]'')
** Support for POSIX syntax ([http://trolltech.com/developer/task-tracker/index_html?id=218604&method=entry bug 218604])
** Removing const modifiers ('''T7''') ([http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry bug 219234] ''[trolltech.com]'', [http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry bug 209041] ''[trolltech.com]'')
** Removing const modifiers ('''T7''') ([http://trolltech.com/developer/task-tracker/index_html?id=219234&method=entry bug 219234], [http://trolltech.com/developer/task-tracker/index_html?id=209041&method=entry bug 209041])
** Non-greedy quantifiers ('''T3''') ([http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry bug 116127] ''[trolltech.com]'')
** Non-greedy quantifiers ('''T3''') ([http://trolltech.com/developer/task-tracker/index_html?id=116127&method=entry bug 116127])


===Low Level issues===
=== Low Level issues ===


* '''T1''': ^ (caret) and $ (dollar) cannot match at each newline
* '''T1''': ^ (caret) and $ (dollar) cannot match at each newline
Line 33: Line 35:
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T3''': lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
* '''T4''': lookbehind is not supported (lookahead is)
* '''T4''': lookbehind is not supported (lookahead is)
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn(“abcd”) for pattern .*returns 3, not 0
* '''T5''': lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
* '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T6''': only linear input is supported, for a text editor like Kate this does not scale
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.
* '''T7''': QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.


=Future=
= Future =


* It '''must''' be a solid 3rd party engine — don’t want to develop an in-house engine and maintain it.
* It '''must''' be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
* QRegExp likely to be moved into its own module in order to keep source compatibility.
* QRegExp likely to be moved into its own module in order to keep source compatibility.
* Addresses the above low-level issues (as much as possible).
* Addresses the above low-level issues (as much as possible).
* (Nice to have) At least the same syntax / features than std::regex
* (Nice to have) At least the same syntax / features than std::regex


==Proposed libraries==
== Proposed libraries ==


* [http://www.pcre.org/ <span class="caps">PCRE</span>] ''[pcre.org]''
* [http://www.pcre.org/ PCRE]
* “V8”
* "V8"
* [http://userguide.icu-project.org/strings/regexp <span class="caps">ICU</span>] ''[userguide.icu-project.org]''
* [http://userguide.icu-project.org/strings/regexp ICU]
* [http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html Boost.Regex] ''[boost.org]''
* [http://www.boost.org/doc/libs/1_47_0/libs/regex/doc/html/index.html Boost.Regex]
* std::regex (new in C++11)
* std::regex (new in C+''11)
* [http://code.google.com/p/re2/ RE2] ''[code.google.com]''
* [http://code.google.com/p/re2/ RE2]


==Feature matrix==
== Feature matrix ==
{| class="wikitable"
!
! QRegExp
! PCRE
! V8
! ICU
! Boost.Regex
! std::regex
! RE2
|-
! General comments
| See above.
|
|
|
|
|
|
|-
! Already being used in Qt?
| Yes
| Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed.
| Yes (Qt 5)
| libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale.
| No
| No
| No
|-
! Pros
|
|Widely used, de-facto standard implementation for Perl-like regexps.
|
|Uses UTF-16 natively.
|
|
|very fast, use a DFA
|-
! Cons
|
|Uses UTF-8, thus requiring string and indexes conversion from/to UTF-16 (QString). UTF-16 support is being developed.|Does not run on every platform supported by QtCore / QtBase[2].
|
|Boost does not give guarantees about ABI compatibility.
|
|uses UTF-8, doesn't have the lookbehind neither lookahead
|-
! Fixes T1
|
|✔
|
| ✔
|
|
|✔
|-
! Fixes T2
|
|✔
|
| ✔
|
|
|✔
|-
! Fixes T3
|
|✔
|
|✔
|
|
|?
|-
! Fixes T4
|
|✔
|
|✘
|
|
| ✘
|-
! Fixes T5
|
|?
|
|?
|
|
|✘
|-
! Fixes T6
|
|✔ ("by hand", with partial matching)
|
|Maybe yes, see [http://userguide.icu-project.org/strings/utext UText] .
|
|
|✔ see StringPiece
|-
! Fixes T7
|
|✔
|
|✔
|
|
| ✘
|}


✘✔
✘✔


===Supported syntax: Characters===


{| class="infotable line"
 
!
=== Supported syntax: Characters ===
 
{| class="wikitable"
!
!
! QRegExp
!
! <span class="caps">PCRE</span>
! QRegExp  
! PCRE
! V8
! V8
! <span class="caps">ICU</span>
! ICU
! Boost.Regex
! Boost.Regex
! std::regex
! std::regex
Line 71: Line 183:
|-
|-
! \a
! \a
| <span class="caps">BELL</span>
|BELL
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \A
! \A
| beginning of input
|beginning of input
|
|  
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \b inside a [set]
! \b inside a [set]
| <span class="caps">BACKSPACE</span>
|BACKSPACE
| ?
|?
| ✔
|✔
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \b outside a [set]
! \b outside a [set]
| on a word boundary
|on a word boundary
| ✔
|✔
| ✔
|✔
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \B
! \B
| not on a word boundary
|not on a word boundary
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|
|
|
|-
|-
! \cX
! \cX
| <span class="caps">ASCII</span> control character X
|ASCII control character X
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \d
! \d
| digit
|digit
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \D
! \D
| non digit
|non digit
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \e
! \e
| <span class="caps">ESCAPE</span>
|ESCAPE
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \E
! \E
| end of \Q … \E quoting
|end of \Q … \E quoting
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \f
! \f
| <span class="caps">FORM</span> <span class="caps">FEED</span>
|FORM FEED
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \G
! \G
| end of previous match
|end of previous match
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \n
! \n
| <span class="caps">LINE</span> <span class="caps">FEED</span>
|LINE FEED
| ✔
|✔
| ✔
|✔
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \N{x}
! \N{x}
| <span class="caps">UNICODE</span> <span class="caps">CHARACTER</span> <span class="caps">NAME</span> x
|UNICODE CHARACTER NAME x
| ✘
|✘
| ✘
|✘
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \p{x}
! \p{x}
| <span class="caps">UNICODE</span> <span class="caps">PROPERTY</span> <span class="caps">NAME</span> x
|UNICODE PROPERTY NAME x
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \P{x}
! \P{x}
| <span class="caps">UNICODE</span> <span class="caps">PROPERTY</span> <span class="caps">NAME</span> not x
|UNICODE PROPERTY NAME not x
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \Q
! \Q
| start of \Q … \E quoting
|start of \Q … \E quoting
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \r
! \r
| <span class="caps">CARRIAGE</span> <span class="caps">RETURN</span>
|CARRIAGE RETURN
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \s
! \s
| white space
|white space
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \S
! \S
| non white space
|non white space
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \t
! \t
| <span class="caps">HORIZONTAL</span> <span class="caps">TAB</span>
|HORIZONTAL TAB
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \uhhhh
! \uhhhh
| U+hhhh (between U+0000 and U+FFFF)
|U+hhhh (between U+0000 and U+FFFF)
| ✘
|✘
| ✘
|✘
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \Uhhhhhhhh
! \Uhhhhhhhh
| U+hhhhhhhh (between U+00000000 and U+0010FFFF)
|U+hhhhhhhh (between U+00000000 and U+0010FFFF)
| ✘
|✘
| ✘
|✘
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \v
! \v
| <span class="caps">VERTICAL</span> <span class="caps">TAB</span>
| VERTICAL TAB
| ✔
|✔
| ✔
|✔
|
|  
| ✘
|✘
|
|  
|
|  
|
|  
|-
|-
! \w
! \w
| word character
|word character
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \W
! \W
| non word character
|non word character
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \x{hhhh}
! \x{hhhh}
| U+hhhh
|U+hhhh
| ✘
|✘
| ✔ (0-10FFFF)
|✔ (0-10FFFF)
|
|  
| ✔ (0-10FFFF)
|✔ (0-10FFFF)
|
|  
|
|  
|
|  
|-
|-
! \xhhhh
! \xhhhh
| U+hhhh
|U+hhhh
| ✔ (0000-<span class="caps">FFFF</span>)
|✔ (0000-FFFF)
| ✔ (00-FF)
|✔ (00-FF)
|
|  
| ✔ (00-FF)
|✔ (00-FF)
|
|  
|
|  
|
|  
|-
|-
! \X
! \X
| grapheme cluster
|grapheme cluster
| ✘
|✘
| ✘
|✘
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \Z
! \Z
| end of input (or before the final \n)
|end of input (or before the final )
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \z
! \z
| end of input
|end of input
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \n
! \n
| n-th backreference
|n-th backreference
| ✔
|✔
| ✔
|✔
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! \0ooo
! \0ooo
| <span class="caps">ASCII</span>/Latin-1 character 0ooo
|ASCII/Latin-1 character 0ooo
| ✔
|✔
| ✔
|✔
|
|  
|
|  
|
|  
|
|  
|
|  
|-
|-
! .
! .
| any character but newlines
|any character but newlines
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! ^
! ^
| line beginning
|line beginning
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! $
! $
| line end
|line end
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! \
! \
| quote the following symbol
|quote the following symbol
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|-
|-
! [pattern]
! [pattern]
| set
|set
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|  
|}
|}


===Supported syntax: Operators===
=== Supported syntax: Operators ===


{| class="infotable line"
{| class="wikitable"
! Operator
! Operator  
!
!  
! QRegExp
! QRegExp  
! <span class="caps">PCRE</span>
! PCRE
! V8
! V8
! <span class="caps">ICU</span>
! ICU
! Boost.Regex
! Boost.Regex
! std::regex
! std::regex
! RE2
! RE2
|-
|-
! *
! *  
| match 0 or more times
|match 0 or more times
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
| ?
|?
| ?
|?
| ✔
|✔
|-
! +
|match 1 or more times
|✔
|✔
|
|✔
|
|
|✔
|-
! ?
|match 0 or 1 times
|✔
|✔
|
|✔
|
|
|✔
|-
! {n}
|match n times
|✔
|✔
|
|✔
|
|
|✔
|-
|-
! +
! {n,}
| match 1 or more times
|match n or more times
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! ?
! {n,m}
| match 0 or 1 times
|match between n and m times
| ✔
|✔
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! {n}
! *?
| match n times
|match 0 or more times, not greedy
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! {n,}
! ''?
| match n or more times
|match 1 or more times, not greedy
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! {n,m}
! ??
| match between n and m times
|match 0 or 1 times, not greedy
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! *?
! {n}?  
| match 0 or more times, not greedy
|match n times
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! <ins>? </ins>
! {n,}?  
| match 1 or more times, not greedy
|match n or more times, not greedy
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! ??
! {n,m}?  
| match 0 or 1 times, not greedy
|match between n and m times, not greedy
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✔
|✔
|-
|-
! {n}?
! *+
| match n times
|match 0 or more times, possessive
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! {n,}?
!''+
| match n or more times, not greedy
|match 1 or more times, possessive
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! {n,m}?
! ?''
| match between n and m times, not greedy
|match 0 or 1 times, possessive
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! *
! {n}+
| match 0 or more times, possessive
|match n times
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! ++
! {n,}+  
| match 1 or more times, possessive
|match n or more times, possessive
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! ?+
! {n,m}+  
| match 0 or 1 times, possessive
|match between n and m times, possessive
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! {n}+
! ( … )
| match n times
|capturing group
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! {n,}+
! (?: … )
| match n or more times, possessive
|group
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! {n,m}+
! (?> … )
| match between n and m times, possessive
|atomic grouping
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! ( … )
! (?# … )  
| capturing group
|comment
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! (?: … )
! (?= … )  
| group
|look-ahead assertion
| ✔
|✔
| ✔
|✔
|
| ✔
|
|
|
|✔
|
|
|  
|
|-
|-
! (?&gt; … )
! (?! … )  
| atomic grouping
|negative look-ahead assertion
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! (?# … )
! (?<= … )  
| comment
|look-behind assertion
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! (?= … )
! (?<! … )  
| look-ahead assertion
|negative look-behind assertion
|
|
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
| ✘
|✘
|-
|-
! (?! … )
! (?flags: … )  
| negative look-ahead assertion
|flags change
|
|
| ✔
|✔
|
|  
|
|  
|
|  
|
|  
|
|
|-
|-
! (?&lt;= … )
! (?flags)  
| look-behind assertion
|flags change
| ✘
|✘
| ✔
|✔
|
|  
| ✔
|✔
|
|  
|
|  
|
|
|-
|-
! (?&lt;! … )
! (?P<name> …)  
| negative look-behind assertion
|named capturing group
| ✘
|✘
| ✔
|✔
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
! (?flags: … )
! (?<name> …)  
| flags change
|named capturing group
| ✘
|✘
| ✔
|✔
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
! (?flags)
! (?'name' …)  
| flags change
|named capturing group
| ✘
|✘
| ✔
|✔
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
! (?P&lt;name&gt; …)
! (?| …)  
| named capturing group
|branch reset
| ✘
|✘
| ✔
|✔
|
|  
| ✘
|✘
|
|  
|
|  
|
|
|}
 
 
=== Supported syntax: flags ===
 
{| class="wikitable"
! Flag
!
! QRegExp
! PCRE
! V8
! ICU
! Boost.Regex
! std::regex
! RE2
|-
|-
! (?&lt;name&gt; …)
! /i
| named capturing group
|case insensitive
|
|
| ✔
|✔
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
! (?‘name’ …)
! /m
| named capturing group
|multi-line
| ✘
|✘
| ✔
|✔
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
! (?| …)
! /s
| branch reset
|dot matches anything
|
|~<ref>It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.</ref>
| ✔
|✔
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
!
! /x
|
|ignore whitespace and comments
|
|
|
|
|
|  
|
|
|
|  
|
|  
|
|
|-
|-
!
! /U
|
|minimal match
|
|
|
|
|
|  
|
|
|
|  
|
|  
|
|
|}
|}


===Supported syntax: flags===
= Benchmarks =
 
=Benchmarks=


See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.
See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.


<sup>1</sup> It’s actually not possible to <span class="caps">UNSET</span> /s for QRegExp, i.e. making the dot not to match a newline.
<references />
 
<sup>2</sup> Cf. http://lists.qt.nokia.com/pipermail/qt5-feedback/2011-August/000962.html
 
===Categories:===
 
* [[:Category:Developing Qt|Developing_Qt]]
** [[:Category:Developing Qt::Qt-Planning|Qt Planning]]
*** [[:Category:Developing Qt::Qt-Planning::Qt-Public-Roadmap|Qt Public Roadmap]]

Latest revision as of 15:41, 22 May 2015


Regular expression engine in Qt5

This page summarizes the research for an alternative regular expression engine to be used in Qt 5.0. The discussion started on the qt5-feedback mailing list, cf.

Current issues with QRegExp

From http://lists.qt-project.org/pipermail/qt5-feedback/2011-September/001054.html

High level issues

  • QRegExp API is broken (see T7 in Low Level)
  • QRegExp is used for QtScript though it does not fullfill the ECMAScript specification ECMA-262-1999 . Missing features include
    • Non-greedy quantifiers (see page 141 titled "- 129 -")
    • But current implementation of QtScript uses JSC which uses its own engine anyway, and only use QRegExp in its api as a container.
  • Patternist/XPath also needs Regex features not found in QRegExp, including
  • Qt Creator might want to offer multi-line Regex search- and replacing later. This cannot be efficient because of T6 described below. GtkSourceView has exactly that problem
  • Customer complained about QRegExp (though I don't see what's their exact problem):
    • In their code they have RegExp? for matching emoticons. Unfortunately, they cannot use QRegExp? because of poor support for negative/positive lookahead. As a workaround they are using the PCRE (Perl Compatible Regular Expressions) library.
  • Public task request:

Low Level issues

  • T1: ^ (caret) and $ (dollar) cannot match at each newline
  • T2: . (dot) always matches newlines
  • T3: lazy/non-greedy/reluctant quantifiers are not supported. this is not to be confused with minimal matching.
  • T4: lookbehind is not supported (lookahead is)
  • T5: lastIndexIn does not find that last match which indexIn would have found, e.g. lastIndexIn("abcd") for pattern ".*" returns 3, not 0
  • T6: only linear input is supported, for a text editor like Kate this does not scale
  • T7: QRegExp combines matcher and match object, despite the 1:n relation. As a consequence matching with a const QRegExp instance modifies a const object.

Future

  • It must be a solid 3rd party engine — don't want to develop an in-house engine and maintain it.
  • QRegExp likely to be moved into its own module in order to keep source compatibility.
  • Addresses the above low-level issues (as much as possible).
  • (Nice to have) At least the same syntax / features than std::regex

Proposed libraries

Feature matrix

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
General comments See above.
Already being used in Qt? Yes Indirectly as a GLIB dependency under Unix. Moreover, a stripped down version of PCRE is available inside WebKit (src/3rdparty/webkit/JavaScriptCore/pcre); all features not required by the JS specification were removed. Yes (Qt 5) libicui18n (optionally?) used by Qt 4.8 / Qt 5 in QLocale. No No No
Pros Widely used, de-facto standard implementation for Perl-like regexps. Uses UTF-16 natively. very fast, use a DFA
Cons Does not run on every platform supported by QtCore / QtBase[2]. Boost does not give guarantees about ABI compatibility. uses UTF-8, doesn't have the lookbehind neither lookahead
Fixes T1
Fixes T2
Fixes T3 ?
Fixes T4
Fixes T5 ? ?
Fixes T6 ✔ ("by hand", with partial matching) Maybe yes, see UText . ✔ see StringPiece
Fixes T7

✘✔


Supported syntax: Characters

QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
\a BELL
\A beginning of input
\b inside a [set] BACKSPACE ?
\b outside a [set] on a word boundary
\B not on a word boundary
\cX ASCII control character X
\d digit
\D non digit
\e ESCAPE
\E end of \Q … \E quoting
\f FORM FEED
\G end of previous match
\n LINE FEED
\N{x} UNICODE CHARACTER NAME x
\p{x} UNICODE PROPERTY NAME x
\P{x} UNICODE PROPERTY NAME not x
\Q start of \Q … \E quoting
\r CARRIAGE RETURN
\s white space
\S non white space
\t HORIZONTAL TAB
\uhhhh U+hhhh (between U+0000 and U+FFFF)
\Uhhhhhhhh U+hhhhhhhh (between U+00000000 and U+0010FFFF)
\v VERTICAL TAB
\w word character
\W non word character
\x{hhhh} U+hhhh ✔ (0-10FFFF) ✔ (0-10FFFF)
\xhhhh U+hhhh ✔ (0000-FFFF) ✔ (00-FF) ✔ (00-FF)
\X grapheme cluster
\Z end of input (or before the final )
\z end of input
\n n-th backreference
\0ooo ASCII/Latin-1 character 0ooo
. any character but newlines
^ line beginning
$ line end
\ quote the following symbol
[pattern] set

Supported syntax: Operators

Operator QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
* match 0 or more times ? ?
+ match 1 or more times
? match 0 or 1 times
{n} match n times
{n,} match n or more times
{n,m} match between n and m times
*? match 0 or more times, not greedy
? match 1 or more times, not greedy
?? match 0 or 1 times, not greedy
{n}? match n times
{n,}? match n or more times, not greedy
{n,m}? match between n and m times, not greedy
*+ match 0 or more times, possessive
+ match 1 or more times, possessive
? match 0 or 1 times, possessive
{n}+ match n times
{n,}+ match n or more times, possessive
{n,m}+ match between n and m times, possessive
( … ) capturing group
(?: … ) group
(?> … ) atomic grouping
(?# … ) comment
(?= … ) look-ahead assertion
(?! … ) negative look-ahead assertion
(?<= … ) look-behind assertion
(?<! … ) negative look-behind assertion
(?flags: … ) flags change
(?flags) flags change
(?P<name> …) named capturing group
(?<name> …) named capturing group
(?'name' …) named capturing group
…) branch reset


Supported syntax: flags

Flag QRegExp PCRE V8 ICU Boost.Regex std::regex RE2
/i case insensitive
/m multi-line
/s dot matches anything ~[1]
/x ignore whitespace and comments
/U minimal match

Benchmarks

See https://gitorious.org/qt-regexp-benchmarks/qt-regexp-benchmarks for the code and https://gitorious.org/qt-regexp-benchmarks/pages/Home for some results.

  1. It's actually not possible to UNSET /s for QRegExp, i.e. making the dot not to match a newline.