1

My intent is to write a shell script to extract a pattern ,using regular expressions, from a file and fill an array with all the ocurrences of the pattern in order to foreach it.

What is the best way to achieve this?

I am trying to do it using sed. And a problem I am facing is that the patterns can have newlines and these newlines must be considered, eg:

File content:

"My name 
is XXX"
"My name is YYY"
"Today
is
the "

When I extract all patterns between double quotes, including the double quotes, the output of the first ocurrence must be:

"My name
is XXX"
13
  • What about removing all the newlines, and then inserting the newlines only between "" ? Also note that writing a parser with only regular expressions to handle escapes sequences is impossible. Commented Jun 12, 2019 at 12:21
  • 2
    grep -Eoz '"[^"]*"' file Commented Jun 12, 2019 at 12:22
  • With this regular expression the newline isn't considered. Like what I have mentioned in the post. Beside, I would like to know how to fill an shell script array with the output of a command like this. Commented Jun 12, 2019 at 12:30
  • the newline isn't considered - what do you mean by this? How is it not considered? grep -z should work. Commented Jun 12, 2019 at 12:31
  • It appears that this command doesn't work for a multi line pattern. The intent is match everything between double quotes. The command suggested only matches something if two double quotes are in the same line. Commented Jun 12, 2019 at 12:36

3 Answers 3

1

fill an array with all the ocurrences of the pattern

First convert your file to have meaningful delimiter, ex. null byte, with ex. GNU sed with -z switch:

sed -z 's/"\([^"]*\)"[^"]*/\1\00/g'

I've added the [^"]* on the end, so that characters not between " are removed.

After it it becomes more trivial to parse it.

You can get the first element with:

head -z -n1

Or sort and count the occurrences:

sort -z | uniq -z -c

Or load to an array with bash's maparray:

maparray -d '' -t arr < <(<input sed -z 's/"\([^"]*\)"[^"]*/\1\00/'g))

Alternatively you can use ex. $'\01' as the separator, as long as it's unique, it becomes simple to parse such data in bash.

Handling such streams is a bit hard in bash. You can't set variable value in shell with embedded null byte. Also expect sometimes warnings on command substitutions. Usually when handling data with arbitrary bytes, I convert it with xxd -p to plain ascii and back with xxd -r -p. With that, it becomes easier.

The following script:

cat <<'EOF' >input
"My name
is XXX"
"My name is YYY"
"Today
is
the "
EOF

sed -z 's/"\([^"]*\)"[^"]*/\1\x00/g' input > input_parsed

echo "##First element is:"
printf '"'
<input_parsed head -z -n1 
printf '"\n'

echo "##Elemets count are:"
<input_parsed sort -z | uniq -z -c

echo
echo "##The array is:"
mapfile -d '' -t arr <input_parsed
declare -p arr

will output (the formatting is a bit off, because of the non-newline delimetered output from uniq):

##First element is:
"My name
is XXX"
##Elemets count are:
      1 My name
is XXX      1 My name is YYY      1 Today
is
the 
##The array is:
declare -a arr=([0]=$'My name\nis XXX' [1]="My name is YYY" [2]=$'Today\nis\nthe ')

Tested on repl.it.

Sign up to request clarification or add additional context in comments.

Comments

0

This may be what you're looking for, depending on the answers to the questions I posted in a comment:

$ readarray -d '' -t arr < <(grep -zo '"[^"]*"' file)

$ printf '%s\n' "${arr[0]}"
"My name
is XXX"

$ declare -p arr
declare -a arr=([0]=$'"My name \nis XXX"' [1]="\"My name is YYY\"" [2]=$'"Today\nis\nthe "')

It uses GNU grep for -z.

Comments

0

Sed can extract your desired pattern with or without newlines. But if you want to store the multiple results into a bash array, it may be easier to make use of bash regex.
Then please try the following:

lines=$(< "file")                   # slurp all lines
re='"[^"]+"'                        # regex to match substring between double quotes
while [[ $lines =~ ($re)(.*) ]]; do
    array+=("${BASH_REMATCH[1]}")   # push the matched pattern to the array
    lines=${BASH_REMATCH[2]}        # update $lines with the remaining part
done

# report the result
for (( i=0; i<${#array[@]}; i++ )); do
    echo "$i: ${array[$i]}"
done

Output:

0: "My name
is XXX"
1: "My name is YYY"
2: "Today
is
the "

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.