We know that regular expressions (RE) are implemented with finite automata (FA). In some language (like JavaScript) in RE there are features like 'capturing parenthesis':

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#special-capturing-parentheses

(x) Matches 'x' and remembers the match, as the following example shows. The parentheses are called capturing parentheses. The '(foo)' and '(bar)' in the pattern /(foo) (bar) \1 \2/ match and remember the first two words in the string "foo bar foo bar". The \1 and \2 in the pattern match the string's last two words.

I want to know if this pattern /(foo) (bar) \1 \2/ is in fact a RE according the definition of RE that we have in theoretical formal language or it is something more powerful. And if it is so I would like to know if this kind of feature is implemented also with FA or in another way (in particualr how it is implemented).

share|cite|improve this question
1  
    
See swtch.com/~rsc/regexp/regexp1.html for a great writeup on the real-world (implementation level) performance implications of this difference. (Edit: I see it's already been linked to from this answer.) – Wildcard 1 hour ago
up vote 2 down vote accepted

The RE in the Automata Theory are equivalent to FA, but for the programming languages (regexp) this is no longer true.

The regular expressions in the programming languages (like PCRE) are far more powerfull than Regular Expressions (type 3) in the Automata Theory.

The matching parenthesis is neither regular nor context-free, it is a context-sensitive feature. But the RegExp from the question does not fully support Type 2 or Type 1.

The bracket matching is not implemented via FA. In case of PCRE it is implemented by guessing and backtracking.

Take a look at Perl Monks description about PCRE.

share|cite|improve this answer
    
Thank you. Then, in this case, RegExp is an abuse of (formal) language. – asv 10 hours ago
2  
Well, it is the name collision. The very initial idea was something like RE, but even when it evolved, the name remained. – Evil 10 hours ago
    
@asv: For some time after capturing was introduced the RE with capture groups was called extended regexp or ERE. Then for some time after Perl introduced their version of RE it was called regex to differentiate it from POSIX standardised regexp and ERE (note regex vs regexp). These days people don't care. – slebetman 4 hours ago

These extended notions of regular expressions capture more than just the regular languages. For example, ([ab]*)\1 matches the language $\{ww\mid w\in\{a,b\}^*\}$, which isn't regular and isn't even context-free (Example 2.38 of Sipser, Introduction to the Theory of Computation, 3rd edition).

"Regular" expressions that don't match regular langauges can't be translated to finite automata, since finite automata match only the regular languages. A side-effect of this is that many libraries don't even try to compile to automata, which can lead to extremely slow matching, even when a "regular" expression is a true regular expression. Russ Cox wrote an excellent article about this, which goes into a lot of the history, too.

share|cite|improve this answer
    
Thank you for your example and for these informations. :) – asv 8 hours ago

The answers are probably answering what you're intending to ask, but not what you're asking.

I want to know if this pattern /(foo) (bar) \1 \2/ is in fact a RE according the definition of RE that we have in theoretical formal language or it is something more powerful. And if it is so I would like to know if this kind of feature is implemented also with FA or in another way (in particualr how it is implemented).

In fact this is a regular expression that can be implemented with a finite automaton, because \1 is guaranteed to evaluate to foo and \2 is guaranteed to evaluate to bar.

Therefore a regex engine could use this fact to actually create a finite automaton that exactly describes the language you proposed.

However, if you make any captures conditional, then this may become false, as others have mentioned.

(Note that I say you may have trouble, because a language like /(a(aa|aa)|(aa|aa)a)\1\2/ can still be described via a FA. I only gave you a necessary condition, not a sufficient one.)

share|cite|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.