How to extract patterns from a file and fill an bash array with them?

Question

My intent is to write a shell script to extract a pattern ,using regular expressions, from a file and fill an array with all the ocurrences of the pattern in order to foreach it.

What is the best way to achieve this?

I am trying to do it using sed. And a problem I am facing is that the patterns can have newlines and these newlines must be considered, eg:

File content:

"My name 
is XXX"
"My name is YYY"
"Today
is
the "

When I extract all patterns between double quotes, including the double quotes, the output of the first ocurrence must be:

"My name
is XXX"

What about removing all the newlines, and then inserting the newlines only between "" ? Also note that writing a parser with only regular expressions to handle escapes sequences is impossible. — KamilCuk
– KamilCuk, Commented Jun 12, 2019 at 12:21
With this regular expression the newline isn't considered. Like what I have mentioned in the post. Beside, I would like to know how to fill an shell script array with the output of a command like this. — Rodolfo
– Rodolfo, Commented Jun 12, 2019 at 12:30
the newline isn't considered - what do you mean by this? How is it not considered? grep -z should work. — KamilCuk
– KamilCuk, Commented Jun 12, 2019 at 12:31
It appears that this command doesn't work for a multi line pattern. The intent is match everything between double quotes. The command suggested only matches something if two double quotes are in the same line. — Rodolfo
– Rodolfo, Commented Jun 12, 2019 at 12:36

KamilCuk · Accepted Answer · 2019-06-12 13:03:31Z

fill an array with all the ocurrences of the pattern

First convert your file to have meaningful delimiter, ex. null byte, with ex. GNU sed with -z switch:

sed -z 's/"\([^"]*\)"[^"]*/\1\00/g'

I've added the [^"]* on the end, so that characters not between " are removed.

After it it becomes more trivial to parse it.

You can get the first element with:

head -z -n1

Or sort and count the occurrences:

sort -z | uniq -z -c

Or load to an array with bash's maparray:

maparray -d '' -t arr < <(<input sed -z 's/"\([^"]*\)"[^"]*/\1\00/'g))

Alternatively you can use ex. $'\01' as the separator, as long as it's unique, it becomes simple to parse such data in bash.

Handling such streams is a bit hard in bash. You can't set variable value in shell with embedded null byte. Also expect sometimes warnings on command substitutions. Usually when handling data with arbitrary bytes, I convert it with xxd -p to plain ascii and back with xxd -r -p. With that, it becomes easier.

The following script:

cat <<'EOF' >input
"My name
is XXX"
"My name is YYY"
"Today
is
the "
EOF

sed -z 's/"\([^"]*\)"[^"]*/\1\x00/g' input > input_parsed

echo "##First element is:"
printf '"'
<input_parsed head -z -n1 
printf '"\n'

echo "##Elemets count are:"
<input_parsed sort -z | uniq -z -c

echo
echo "##The array is:"
mapfile -d '' -t arr <input_parsed
declare -p arr

will output (the formatting is a bit off, because of the non-newline delimetered output from uniq):

##First element is:
"My name
is XXX"
##Elemets count are:
      1 My name
is XXX      1 My name is YYY      1 Today
is
the 
##The array is:
declare -a arr=([0]=$'My name\nis XXX' [1]="My name is YYY" [2]=$'Today\nis\nthe ')

Tested on repl.it.

Ed Morton · Accepted Answer · 2019-06-12 13:35:55Z

0

This may be what you're looking for, depending on the answers to the questions I posted in a comment:

$ readarray -d '' -t arr < <(grep -zo '"[^"]*"' file)

$ printf '%s\n' "${arr[0]}"
"My name
is XXX"

$ declare -p arr
declare -a arr=([0]=$'"My name \nis XXX"' [1]="\"My name is YYY\"" [2]=$'"Today\nis\nthe "')

It uses GNU grep for -z.

edited Jun 12, 2019 at 13:35

answered Jun 12, 2019 at 13:29

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

tshiono · Accepted Answer · 2019-06-13 00:28:07Z

0

Sed can extract your desired pattern with or without newlines. But if you want to store the multiple results into a bash array, it may be easier to make use of bash regex.
Then please try the following:

lines=$(< "file")                   # slurp all lines
re='"[^"]+"'                        # regex to match substring between double quotes
while [[ $lines =~ ($re)(.*) ]]; do
    array+=("${BASH_REMATCH[1]}")   # push the matched pattern to the array
    lines=${BASH_REMATCH[2]}        # update $lines with the remaining part
done

# report the result
for (( i=0; i<${#array[@]}; i++ )); do
    echo "$i: ${array[$i]}"
done

Output:

0: "My name
is XXX"
1: "My name is YYY"
2: "Today
is
the "

edited Jun 13, 2019 at 0:28

answered Jun 13, 2019 at 0:07

tshiono

22.3k2 gold badges18 silver badges26 bronze badges

Collectives™ on Stack Overflow

How to extract patterns from a file and fill an bash array with them?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related