Regular expressions and 'capturing parentheses'

Question

We know that regular expressions (RE) are implemented with finite automata (FA). In some language (like JavaScript) in RE there are features like 'capturing parenthesis':

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-capturing-parentheses

(x) Matches 'x' and remembers the match, as the following example shows. The parentheses are called capturing parentheses. The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and remember the first two words in the string "foo bar foo bar". The \1 and \2 in the pattern match the string's last two words.

I want to know if this pattern /(foo) (bar) \1 \2/ is in fact a RE according the definition of RE that we have in theoretical formal language or it is something more powerful. And if it is so I would like to know if this kind of feature is implemented also with FA or in another way (in particualr how it is implemented).

See cs.stackexchange.com/q/4839/755 and cs.stackexchange.com/q/38451/755 and cs.stackexchange.com/q/24031/755. — D.W.♦, 9 hours ago
See swtch.com/~rsc/regexp/regexp1.html for a great writeup on the real-world (implementation level) performance implications of this difference. (Edit: I see it's already been linked to from this answer.) — Wildcard, 1 hour ago

Evil · Accepted Answer · 2017-01-23 21:32:07Z

up vote 2 down vote accepted

The RE in the Automata Theory are equivalent to FA, but for the programming languages (regexp) this is no longer true.

The regular expressions in the programming languages (like PCRE) are far more powerfull than Regular Expressions (type 3) in the Automata Theory.

The matching parenthesis is neither regular nor context-free, it is a context-sensitive feature. But the RegExp from the question does not fully support Type 2 or Type 1.

The bracket matching is not implemented via FA. In case of PCRE it is implemented by guessing and backtracking.

Take a look at Perl Monks description about PCRE.

edited 10 hours ago

answered 10 hours ago

Evil

5,12341540

Thank you. Then, in this case, RegExp is an abuse of (formal) language. – asv 10 hours ago

2

Well, it is the name collision. The very initial idea was something like RE, but even when it evolved, the name remained. – Evil 10 hours ago

@asv: For some time after capturing was introduced the RE with capture groups was called extended regexp or ERE. Then for some time after Perl introduced their version of RE it was called regex to differentiate it from POSIX standardised regexp and ERE (note regex vs regexp). These days people don't care. – slebetman 4 hours ago

add a comment |

David Richerby · Answer 2 · 2017-01-23 22:25:00Z

These extended notions of regular expressions capture more than just the regular languages. For example, ([ab]*)\1 matches the language $\{ww\mid w\in\{a,b\}^*\}$, which isn't regular and isn't even context-free (Example 2.38 of Sipser, Introduction to the Theory of Computation, 3rd edition).

"Regular" expressions that don't match regular langauges can't be translated to finite automata, since finite automata match only the regular languages. A side-effect of this is that many libraries don't even try to compile to automata, which can lead to extremely slow matching, even when a "regular" expression is a true regular expression. Russ Cox wrote an excellent article about this, which goes into a lot of the history, too.

Thank you for your example and for these informations. :) – asv 8 hours ago — asv, 8 hours ago

Mehrdad · Answer 3 · 2017-01-24 04:29:16Z

The answers are probably answering what you're intending to ask, but not what you're asking.

I want to know if this pattern /(foo) (bar) \1 \2/ is in fact a RE according the definition of RE that we have in theoretical formal language or it is something more powerful. And if it is so I would like to know if this kind of feature is implemented also with FA or in another way (in particualr how it is implemented).

In fact this is a regular expression that can be implemented with a finite automaton, because \1 is guaranteed to evaluate to foo and \2 is guaranteed to evaluate to bar.

Therefore a regex engine could use this fact to actually create a finite automaton that exactly describes the language you proposed.

However, if you make any captures conditional, then this may become false, as others have mentioned.

(Note that I say you may have trouble, because a language like /(a(aa|aa)|(aa|aa)a)\1\2/ can still be described via a FA. I only gave you a necessary condition, not a sufficient one.)

asked	today
viewed	212 times
active	today

current community

your communities

more stack exchange communities

Regular expressions and 'capturing parentheses'

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged finite-automata regular-expressions or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

Regular expressions and 'capturing parentheses'

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged finite-automata regular-expressions or ask your own question.

Linked

Related

Hot Network Questions