2

I need to extract all K[A-Z]{4} and US[C,W][0-9]{8} values from every line in a text file.

I am using the below code to try and achieve this but, I need to extract these values, based on the condition when ONLY both are present in a given line (i.e., the last three lines in the below data).

Attempted Code:

#Filters out any values matching K[A-Z]{4}
grep -Po '"\K[A-Z]{4}\b' usc.matched > out.1

#Filters out any values matching US[C,W][0-9]{8}
grep -Po '\bUS\w*' usc.matched > out.2

#Pastes two datasets together, separated by a comma
paste -d',' out.1 out.2 > stations.filtered

#Removes any lines that do not lead with "K"
sed -i '/^[^K]/d' stations.filtered

JSON Data:

{"sids": ["94737 1", "RUT 3", "KRUT 5"], "name": "RUTLAND STATE AP"},
{"sids": ["54740 1", "VSF 3", "KVSF 5", "USW00054740 6"], "name": "SPRINGFIELD HARTNESS AP"},
{"sids": ["94601 1", "RKD 3", "KRKD 5"], "name": "ROCKLAND KNOX CO RGNL AP"},
{"sids": ["20B 3"], "name": "ROCKLAND STN"},
{"sids": ["177250 2", "USC00177250 6"], "name": "ROCKLAND"},
{"sids": ["177255 2", "USC00177255 6", "RCKM1 7"], "name": "ROCKLAND"},
{"sids": ["177260 2"], "name": "ROCKLAND MOORING LBS"},
{"sids": [], "name": "ROCKLAND"},
{"sids": ["14612 1"], "name": "ROCKLAND"},
{"sids": ["274380 2", "USC00274380 6"], "name": "KEARSARGE"},
{"sids": ["192770 2", "USC00192770 6"], "name": "FISKDALE"},
{"sids": ["US1CTNL0005 6", "CTNL0005 10"], "name": "OAKDALE 2.6 WNW"},
{"sids": ["063989 2", "USC00063989 6"], "name": "LAKE KONOMOC"},
{"sids": ["14740 1", "14721 1", "063456 2", "069704 2", "BDL 3", "72508 4", "KBDL 5", "USW00014740 6", "BDL 7"], "name": "HARTFORD-BRADLEY INTL AP"},
{"sids": ["94702 1", "060806 2", "BDR 3", "72504 4", "KBDR 5", "USW00094702 6", "BDR 7"], "name": "IGOR I SIKORSKY MEMORI AP"},
{"sids": ["54734 1", "DXR 3", "KDXR 5", "USW00054734 6"], "name": "DANBURY MUNI AP"},

Current Output:

KRUT,
KVSF,USW00054740
KRKD

USC00177250
USC00177255


USC00274380
USC00192770
US1CTNL0005
USC00063989
KBDL,USW00014740
KBDR,USW00094702
KDXR,USW00054734

Expected Output:

KVSF,USW00054740
KBDL,USW00014740
KBDR,USW00094702
KDXR,USW00054734
1
  • Yes, I have jq installed. i have also appended Current Output and Expected Output to help further define what I am trying to do Commented Jul 11, 2017 at 10:08

3 Answers 3

2

You can use:

awk -F '[][" \t{},:]+' '{
a=b=""
for(i=2; i<=NF; i++)
   if ($i ~ /^K[A-Z]{3}$/)
      a=$i
   else if ($i ~ /^US[CW][0-9]+/)
      b=$i
   if (a != "" && b != "")
      print a, b
}' OFS=, file

KVSF,USW00054740
KBDL,USW00014740
KBDR,USW00094702
KDXR,USW00054734
Sign up to request clarification or add additional context in comments.

Comments

2

if perl is okay (assumes K string precedes US string in same line)

$ perl -lne 'print "$1,$2" if /"(K[A-Z]{3})\b.*"(US[CW]\d{8}\b)/' usc.matched 
KVSF,USW00054740
KBDL,USW00014740
KBDR,USW00094702
KDXR,USW00054734
  • if /"(K[A-Z]{3})\b.*"(US[CW]\d{8}\b)/ only if this condition matches
    • print "$1,$2" print the two captured groups
    • "(K[A-Z]{3})\b matches K followed by three uppercase letters only if preceded by " and ending with word boundary
    • "(US[CW]\d{8}\b matches US followed by C or W and eight digits only if preceded by " and ending with word boundary
  • See http://perldoc.perl.org/perlrun.html#Command-Switches for details on -lne options

Comments

1

In awk. Tune the regexen to your liking:

$ awk -v OFS=, '
/K[A-Z]{3} / && /US[C,W][0-9]{8}/ {
    b=""
    while(match($0,/K[A-Z]{3} |US[C,W][0-9]{8}/)) {
        b=b (b==""?"":OFS) substr( $0, RSTART, RLENGTH)
        $0=substr($0,RSTART+RLENGTH)
    } 
print b}' file
KVSF ,USW00054740
KBDL ,USW00014740
KBDR ,USW00094702
KDXR ,USW00054734

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.