Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have to parse huge text files where certain lines are of interest and others are not. Within those of interest I have to count the occurrences of a certain keyword.

Assumed the file is called input.txt and it looks like this:

format300,format250,format300
format250,ignore,format160,format300,format300
format250,format250,format300

I want to exclude the lines with ignore and count the number of format300, how do I do that?

What I've got so far is this command which only counts ONCE PER LINE (which is not yet good enough):

cat input.txt | grep -v ignore | grep 'format300' | wc -l

Any suggestions? If possible I want to avoid using perl.

share|improve this question

This one-liner should be able to do what you want:

grep -v ignore input.txt | sed 's/format300/format300\n/g' | grep -c "format300"

basically you are replacing each occurrence of your keyword with the keyword itself and a newline character, which effectively makes your input stream have the keyword only once on any given line. Then grep -c is counting lines with your keyword in them.

share|improve this answer
2  
I would replace the sed with tr ',' '\n' otherwise you're going to count format3000s as well not just format300s – 1_CR 12 hours ago
1  
@1_CR, agreed, but you would also have to use grep -xc format300 instead of grep -c format300 to avoid false positives on "format3000". So full solution is grep -v ignore input.txt | tr , '\n' | grep -xc format300 – Wildcard 9 hours ago
    
@Wildcard, indeed – 1_CR 9 hours ago

You don't need the first cat, that it is known as a Useless use of cat (UUOC).

Also, very useful is grep -o, that only outputs the matching patterns, one per line.

And then, count lines with wc -l.

grep -v ignore YOUR_FILE | grep -o format300 | wc -l

This prints 3 for your small sample.

share|improve this answer

A Perl way:

perl -lne '$k+=(s/format300//g) unless /ignore/; }{ print $k' input.txt 

The s/format300//g will replace all occurrences of format300 with nothing, and returns the number of replacements. It's a simple way of counting the occurrences. The number is then added to $k and the whole thing only happens if the line doesn't match ignore. The }{ is perl shorthand for "do this after you've finished reading the file, so print $k will print the total number found.

share|improve this answer

Input file may potentially contain partial matches that would invalidate the result, for example:

1 format300,format250,format300
2 format250,ignore,format160,format300,format300
3 format250,format250,format300
4 format999,format300000,format999
5 format999,ignore_me_not,format300

You don't want to count format300000 on line 4 or ignore line 5 because ignore_me_not contains the substring ignore.

This would do the trick:

grep -v "\bignore\b" FILE |grep -o "\bformat300\b"|wc -l

Correct output is

4

..becase line 2 is ignored, line 5 is not, and line 4 doesn't contain exactly format300.

If you let the wc -l part out, you can see what exactly is being matched:

enter image description here

share|improve this answer
    
Clever, but it goes beyond his actual stated requirements: "I want to exclude the lines with ignore..." Your point about format3000 et. al. is perfectly valid, though. (As I commented on another answer.) – Wildcard 9 hours ago
    
This is the only solution that is considering the false positives, the others so far will all fail with this input. There is still a problem, though, because the word boundaries aren't quite what are needed. A character like - is not a word character, so ignore-me-not will match \bignore\b but shouldn't. – leftclickben 9 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.