Need help parsing a file via UNIX commands

Question

I have a file that has lines that look like this

LINEID1:FIELD1=ABCD,&FIELD2-0&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=ABCD,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=ABCD,&FIELD7-0&FIELD8-0;

LINEID1:FIELD1=XYZ,&FIELD2-0&FIELD3-1&FIELD9-0
LINEID3:FIELD1=XYZ,&FIELD7-0&FIELD8-0;

LINEID1:FIELD1=PQRS,&FIELD3-1&FIELD4-0&FIELD9-0;
LINEID2:FIELD1=PQRS,&FIELD5-1&FIELD6-0;
LINEID3:FIELD1=PQRS,&FIELD7-0&FIELD8-0;

I'm interested in only the lines that begin with LINEID1 and only some elements (FIELD1, FIELD2, FIELD4 and FIELD9) from that line. The output should look like this (no & signs.can replace with |)

FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0;
FIELD1=PQRS|FIELD4-0|FIELD9-0;

If additional information is required, do let me know, I'll post them in edits. Thanks!!

Looks like you want FIELD9, not FIELD5? And XYZ in data doesn't match WXYZ in output. — Tom Zych
– Tom Zych, Commented Aug 30, 2014 at 7:57

Mark Setchell · Accepted Answer · 2014-08-30 08:44:29Z

4

This is not exactly what you asked for, but no-one else is answering and it is pretty close for you to get started with!

awk -F'[&:]' '/^LINEID1:/{print $2,$3,$5,$6}' OFS='|' file

Output

FIELD1=ABCD,|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ,|FIELD2-0|FIELD9-0|
FIELD1=PQRS,|FIELD3-1|FIELD9-0;|

The -F sets the Input Field Separator to colon or ampersand. Then it looks for lines starting LINEID1: and prints the fields you need. The OFS sets the Output Field Separator to the pipe symbol |.

edited Aug 30, 2014 at 8:44

answered Aug 30, 2014 at 8:17

Mark Setchell

210k32 gold badges310 silver badges505 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Joshua1729 Over a year ago

Thanks! a sed to replace the ','s and the file is good to go !

Mark Setchell Over a year ago

You can make the semicolons disappear by telling awk they are field separators, so change the -F'[&:]' to -F'[&:;]'

Joshua1729 Over a year ago

Oh! Thanks for that.. my knowledge in UNIX is negligible.. i used sed to get rid of them :D

Reuben L. · Accepted Answer · 2014-08-30 08:49:21Z

2

Pure awk:

awk -F ":" ' /LINEID1[^0-9]/{gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2); gsub(/,*&+/,"|",$2); print $2} ' file

Updated to give proper formatting and to omit LINEID11, etc...

Output:

FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;

Explanation:

awk -F ":" - split lines into LHS ($1) and RHS ($2) since output only requires RHS

/LINEID1[^0-9]/ - return only lines that match LINEID1 and also ignores LINEID11, LINEID100 etc...

gsub(/FIELD[^1249]+[-=][A-Z0-9]+/,"",$2) - remove all fields that aren't 1, 4 or 9 on the RHS

gsub(/,*&+/,"|",$2) - clean up the leftover delimiters on the RHS

edited Aug 30, 2014 at 8:49

answered Aug 30, 2014 at 8:31

Reuben L.

2,8592 gold badges31 silver badges48 bronze badges

1 Comment

Mark Setchell Over a year ago

This will also find lines with LINEID10, LINEID11 LINEID199.

aks · Accepted Answer · 2014-08-30 08:23:29Z

To select rows from data with Unix command lines, use grep, awk, perl, python, or ruby (in increasing order of power & possible complexity).

To select columns from data, use cut, awk, or one of the previously mentioned scripting languages.

First, let's get only the lines with LINEID1 (assuming the input is in a file called input).

grep '^LINEID1' input

will output all the lines beginning with LINEID1.

Next, extract the columns we care about:

grep '^LINEID1' input |   # extract lines with LINEID1 in them
cut -d: -f2           |   # extract column 2 (after ':')
tr ',&' '\n\n'        |   # turn ',' and '&' into newlines
egrep 'FIELD[1249]'   |   # extract only fields FIELD1, FIELD2, FIELD4, FIELD9
tr '\n' '|'           |   # turn newlines into '|'
sed -e $'s/\\|\\(FIELD1\\)/\\\n\\1/g' -e 's/\|$//'

The last line inserts newlines in front of the FIELD1 lines, and removes any trailing '|'.

That last sed pattern is a little more challenging because sed doesn't like literal newlines in its replacement patterns. To put a literal newline, a bash escape needs to be used, which then requires escapes throughout that string.

Here's the output from the above command:

FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;

This command took only a couple of minutes to cobble up.

Even so, it's bordering on the complexity threshold where I would shift to perl or ruby because of their excellent string processing.

The same script in ruby might look like:

#!/usr/bin/env ruby
#
while line = gets do
  if line.chomp =~ /^LINEID1:(.*)$/
    f1, others = $1.split(',')
    fields = others.split('&').map {|f| f if f =~ /FIELD[1249]/}.compact
    puts [f1, fields].flatten.join("|")
  end
end

Run this script on the same input file and the same output as above will occur:

$ ./parse-fields.rb < input
FIELD1=ABCD|FIELD2-0|FIELD4-0|FIELD9-0;
FIELD1=XYZ|FIELD2-0|FIELD9-0
FIELD1=PQRS|FIELD4-0|FIELD9-0;

Collectives™ on Stack Overflow

Need help parsing a file via UNIX commands

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related