Sunday, July 1, 2007

Extracting part of a field in awk

I had a file where the lines look like this:

all_initstring0010010100111000/1dca.rule118.iter100.score all_initstring0010010100111000/1dca.rule140.iter100.score: 0
all_initstring0010010100111000/1dca.rule128.iter100.score all_initstring0010010100111000/1dca.rule122.iter100.score: 122312
all_initstring0010010100111000/1dca.rule113.iter100.score all_initstring0010010100111000/1dca.rule143.iter100.score: 3213


I wanted to extract the value after rule and before the . in the 1st and 2nd fields and also print the third field. I used awk and the substitution function to replace everything but the required value using a regular expression. Here's the code:

gawk '{gsub(/^.*rule/,"",$1); gsub(/[^0-9].*/,"",$1); gsub(/^.*rule/,"",$2); gsub(/[^0-9].*/,"",$2); print $1 " " $2 " " $3}' myfile


There is actually another solution which is probably nicer:

awk '{ split($1,a1,/\./) ; split($2,a2,/\./); print substr(a1[2],5), substr(a2[2],5), $NF; }' myfile 

2 comments:

Niall Haslam said...

how would grep -o fare in this example?

new said...

You could have a regular expression which matches rule followed by 1 or more numbers and just output but there are a few problems I don't know how you'd get around with grep. The biggest is that grep stops after it finds the first match on a line (is there a way round this) and I have two ruleXXXs per line, both of which I want printed. The second is that I also want the final number on the line printed. Another problem is that grep would print ruleXXX rather than just the number, you could probably run it though grep again to strip rule off by just matching numbers.