2

I have 2 problems with a script:

  1. Passing the correct variable into awk
  2. Awk doesn't like the specific command used to specify the begninnging value and the ending value to print in between a specified pattern.

Here is the content of states.txt:

Alabama

Area: 52,423 sq.mi (135,775 sq.km.), 30th
Land: 50,750 sq.mi. (131,442 sq.km.), 28th
Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
Coastline: 53 mi. (85 km.), 17th
Shoreline: 607 mi. (977 km.), 19th

Alaska

Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
Water: 86,051 sq.mi. (222,871 sq.km.), 1st
Coastline: 6,640 mi. (10,686 km.), 1st
Shoreline: 33,904 mi. (54,563 km.), 1st

Arizona

Area: 114,006 sq.mi (295,274 sq.km.), 6th
Land: 113,642 sq.mi. (294,332 sq.km.), 6th
Water: 364 sq.mi. (943 sq.km.), 48th

Arkansas

Area: 53,182 sq.mi (137,741 sq.km.), 29th
Land: 52,075 sq.mi. (134,874 sq.km.), 27th
Water: 1,107 sq.mi. (2,867 sq.km.), 31st

California

Area: 163,707 sq.mi (423,999 sq.km.), 3rd
Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
Water: 7,734 sq.mi. (20,031 sq.km.), 6th
Coastline: 840 mi. (1,352 km.), 3rd
Shoreline: 3,427 mi. (5,515 km.), 5th

Colorado

Area: 104,100 sq.mi (269,618 sq.km.), 8th
Land: 103,730 sq.mi. (268,660 sq.km.), 8th
Water: 371 sq.mi. (961 sq.km.), 46th'

And so-on and so-forth

What I am trying to do is develop a script that pulls the information for each state individually while parsing it.

So the script looks something like this:

for state in $(cat states.txt | egrep -v 'Area|Land|Water' | grep [A-Z]) ; do 

echo $state >> ./statelist.txt ; 

done ;

for statesnip in $(cat ./statelist.txt | awk 'NR>1{print p "_" $0 ORS} {p=$0}' | grep [A-Z]) ; do 

    state1=$(echo $statesnip | awk -F _ '{print $1}') ; 
    state2=$(echo $statesnip | awk -F _ '{print $2}') ; 

    cat ./states.txt | awk '/$state1/{f=1}; /$state2/{f=0}' >> $state1.tmp.txt ; 

done;

rm -f ./statelist.txt

So here is what is breaking:

The first, being the variables passing into awk:

as in

awk -v state1=$state1 -v state2=$state2 '/state1/{f=1} f; /state2/{f=0}';

or

awk -v state1=${state1} state2=${state2} '/state1/{f=1} f; /state2/{f=0}';

I get an error

And the second being that awk doesn't like it when I adjust variables into their -v format (it just cat's the entire file, numerous times).

 awk -v state1=${state1} -v state2=${state2} 'state1{f=1} f; state2{f=0}'

I just get a full cat of the entire file repeatedly.

The expected output should look like this:

cat ./statelist.txt

Alabama
Alaska
Arizona
Arkansas
California
Colorado

cat ./statelist.txt | awk 'NR>1{print p "_" $0 ORS} {p=$0}' | grep [A-Z]

Alabama_Alaska
Alaska_Arizona
Arizona_Arkansas
Arkansas_California
California_Colorado

cat ./Alabama.txt:

Alabama

Area: 52,423 sq.mi (135,775 sq.km.), 30th
Land: 50,750 sq.mi. (131,442 sq.km.), 28th
Water: 1,673 sq.mi. (4,333 sq.km.), 23rd
Coastline: 53 mi. (85 km.), 17th
Shoreline: 607 mi. (977 km.), 19th

cat ./Alaska.txt

Alaska

Area: 656,425 sq.mi (1,700,134 sq.km.), 1st
Land: 570,374 sq.mi. (1,477,263 sq.km.), 1st
Water: 86,051 sq.mi. (222,871 sq.km.), 1st
Coastline: 6,640 mi. (10,686 km.), 1st
Shoreline: 33,904 mi. (54,563 km.), 1st

cat ./Arizona.txt

Arizona

Area: 114,006 sq.mi (295,274 sq.km.), 6th
Land: 113,642 sq.mi. (294,332 sq.km.), 6th
Water: 364 sq.mi. (943 sq.km.), 48th

cat ./Arkansas.txt

Arkansas

Area: 53,182 sq.mi (137,741 sq.km.), 29th
Land: 52,075 sq.mi. (134,874 sq.km.), 27th
Water: 1,107 sq.mi. (2,867 sq.km.), 31st

cat ./California.txt

California

Area: 163,707 sq.mi (423,999 sq.km.), 3rd
Land: 155,973 sq.mi. (403,969 sq.km.), 3rd
Water: 7,734 sq.mi. (20,031 sq.km.), 6th
Coastline: 840 mi. (1,352 km.), 3rd
Shoreline: 3,427 mi. (5,515 km.), 5th

cat ./Colorado.txt

Colorado

Area: 104,100 sq.mi (269,618 sq.km.), 8th
Land: 103,730 sq.mi. (268,660 sq.km.), 8th
Water: 371 sq.mi. (961 sq.km.), 46th'
2
  • 1
    Are you just trying to split that original input file into files named by state that contain the data between the state name and the next state name? Commented Apr 28, 2015 at 18:41
  • 2
    cat foo.txt | awk '{ ... }' is better written as awk '{ ... }' foo.txt. Commented Apr 28, 2015 at 19:46

2 Answers 2

5

Any time you write a loop in shell just to manipulate text you have the wrong approach.

In this case, it LOOKS like all you really need for the whole thing is:

awk 'NF==1{out=$1".txt"} {print > out}' states.txt

If that's not it, please clarify. Oh, and with non-gawk you might need to add close(out) right before out=....

Sign up to request clarification or add additional context in comments.

4 Comments

Just as a quick point of observation: I found the following happened when I used your script: cat states.txt | grep [a-z] | awk 'NF==1{out=$1".txt"} {print > out}' awk: (FILENAME=- FNR=1) fatal: expression for >' redirection has null string value'` After much googling, I found that the reason this happens is because awk doesn't like opening a whole bunch of files at the same time. This being the case, I had to adjust the output to: cat states.txt | grep [a-z] | egrep -v configure | awk 'NF==1{out=$1} {print > out".txt"}''
@AndyD'Arata wrt awk doesn't like opening a whole bunch of files at the same time that is absolutely nonsense so wherever you found that info through google make sure you never look there again. awk is MADE to handle multiple files. That error message simply means your file starts with 1 or more blank lines so the first time the print is hit the output file name out is still null. You can fix it in one of many ways, including by just changing {print > out} to out{print > out}. Please understand - you NEVER need a chain of pipes with cats, greps, etc when you are using awk.
You didn't show us anything in your input file that would make the grep and egrep necessary. They are trivially made redundant in awk with a leading !/[a-z]/ || /configure/{next} but it's not clear you need them at all and that may not have been the way I'd have done it off the bat. You almost certainly meant [[:lower:]] instead of [a-z] though since [a-z] can include A, B, C, etc,. Your "fix" btw would have created a hidden file in your directory named .txt and invokes undefined behavior with the unparenthesized expression on the right side of output redirection.
Yeah, in my enthusiasm to show the fix I had found, I forgot to take the chain grep out. Truth be told, the states.txt that I was working with was a fabricated example due to the sensitivity of the actual data that I was parsing through.
2

Though the question implies that awk is being used to parse a file, the script given uses more other commands than it uses awk. Awk could be used to do the whole job.

awk \
  ' \
    BEGIN \
    { FS = ":" }
    NF == 1 && /^[A-Z]/ \
    { FILE = $0 ".txt"; printf "\n%s\n\n", $0 >FILE }
    NF > 1 \
    { print >FILE }
  ' states.txt

Though a smaller script could do the job, this one has a little extra. Use of colon as a field delimiter quickly differentiates data from title lines. Blank lines are ignored and printf() used to generate title lines in the output files. This means blanks aren't needed in the input file and means that extra whitespace or blank lines don't mess up the output. That may or may not be what you want.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.