Join the Stack Overflow Community
Stack Overflow is a community of 6.6 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I have a bunch of s in a script. I would like to know how many capture groups are in them. More precisely I'd like to know how many items would be added to the @- and @+ arrays if they matched before actually use them in a real match op.

An example:

'XXAB(CD)DE\FG\XX' =~ /(?i)x(ab)\(cd\)(?:de)\\(fg\\)x/
    and print "'@-', '@+'\n";

In this case the output is:

'1 2 11', '15 4 14'

So after matching I know that the 0th item is the matched part of the string, and there are two capture group expressions. Would it be possible to know right before the actual match?

I tried to concentrate onto the opening brackets. So I removed the '\\' patterns first to make easier to detect the escaped brackets. Then I removed '\(' strings. Then came '(?'. Now I can count the remaining opening brackets.

my $re = '(?i)x(ab)\(cd\)(?:de)\\\\(fg\\\\)x'; print "ORIG: '$re'\n";
'XXAB(CD)DE\FG\XX' =~ /$re/ and print "RE: '@-', '@+'\n";
$re =~ s/\\\\//g; print "\\\\: '$re'\n";
$re =~ s/\\\(//g; print "\\(: '$re'\n";
$re =~ s/\(\?//g; print "\\?: '$re'\n";
my $n = ($re =~ s/\(//g); print "n=$n\n";

Output:

ORIG: '(?i)x(ab)\(cd\)(?:de)\\(fg\\)x'
RE: '1 2 11', '15 4 14'
\\: '(?i)x(ab)\(cd\)(?:de)(fg)x'
\(: '(?i)x(ab)cd\)(?:de)(fg)x'
\?: 'i)x(ab)cd\):de)(fg)x'
n=2

So here I know that 2 capture groups are in this . But maybe there is an easier way and this is definitely not complete (e.g. this treats (?<foo>...) and (?'foo'...) as a non-caputre groups).

Another way would be to dump the internal data structures of regcomp function. Maybe the package Regexp::Debugger could solve the issue, but I have no right to install packages in my environment.

Actually the s are keys to some ARRAY refs and I'd like to check if the referenced ARRAY contains the proper amount of values before actually applying the s. Of course this checking can be done right after the pattern matching, but it would be nicer if I could do it in the loading stage of the script.

Thank you for your help and comments in advance!

share|improve this question
    
Doesn't handle [^()] – ikegami Jan 19 at 14:51
    
Doesn't handle # () (when /x is used) – ikegami Jan 19 at 14:51
    
Doesn't handle (?{ () }) and similar. – ikegami Jan 19 at 14:52
    
Re "but I have no right to install packages in my environment", You need no special permissions to install modules. – ikegami Jan 19 at 14:52
1  
If you won't install modules, I don't see what you want from us. – ikegami Jan 19 at 14:55

Regex:

\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))

Explanation:

\\.                     # Match any escaped character
(*SKIP)(?!)             # Discard it
|                       # OR
\(                      # Match a single `(`
(?(?=\?)                # Which if is followed by `?`
    \?                      # Match `?`
    P?['<]\w+['>]           # Next characters should be matched as ?P'name', ?<name> or ?'name'
)                       # End of conditional statement

Perl:

my @offsets = ();
while ('XXAB(CD)DE\FG\X(X)' =~ /\\.(*SKIP)(?!)|\((?(?=\?)\?(P?['<]\w+['>]))/g){
    push @offsets, "$-[0]";
}
print join(", ", @offsets);

Output:

4, 15

Which represents existence of two capturing groups in input string.

share|improve this answer
    
Thanks for your answer! As I see you just repacked my trying. Nice RE and I can learn a lot from it, but unfortunately it is still far not complete. – TrueY Jan 20 at 12:24

Without any limiting requirements for the occuring regexes, there is no definitive answer to the number of capture groups, I think. Just think of alternatives with a differing capture group count and the possibility of this occuring again in each branch:

my $re = qr/ A(B)C | A(D|(E(G+|H))F /x;

This regex can obviously have up to 3 capture groups. You could recursively parse each branch, and take the highest number as your result - but I honestly cannot come up with a practical way to do this in a short time. For 'linear' regexes not using alternatives or non-basic regex features, the task of determining the count of capture groups is possible, but I don't think it's feasible with more advanced ones.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.