4

I have a bash script that takes a simple properties file and substitutes the values into another file. (Property file is just lines of 'foo=bar' type properties)

INPUT=`cat $INPUT_FILE`
while read line; do
   PROP_NAME=`echo $line | cut -f1 -d'='`
   PROP_VALUE=`echo $line | cut -f2- -d'=' | sed 's/\$/\\\$/g`
   time INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")"
done <<<$(cat "$PROPERTIES_FILE")
# Do more stuff with INPUT

However, when my machine has high load (upper forties) I get a large time loss on my seds

real  0m0.169s
user  0m0.001s
sys  0m0.006s

Low load:

real  0m0.011s
user  0m0.002s
sys  0m0.004s

Normally losing 0.1 seconds isn't a huge deal but both the properties file and the input files are hundreds/thousands of lines long and those .1 seconds add up to over an hour of wasted time.

What can I do to fix this? Do I just need more CPUs?

Sample properties (lines start with special char to create a way to indicate that something in the input is trying to access a property)

$foo=bar
$hello=world
^hello=goodbye

Sample input

This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Expected result

This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"
29
  • 9
    if you insist on doing this in bash then consider eliminating the 6 subshells that are invoked on each pass through the loop; for starters use the read to split the data on the = delimiter (eg, while IFS='=' read -r name value, you'll get much better performance; for even better performance, regardless of system load, consider using a different tool (eg, awk, perl, python, etc), especially since these tools can perform the updates with a single pass through the source (as opposed to the repeated passes with the current code) Commented Apr 1 at 23:48
  • 2
    yes, repeated subshell invocations to call sed to repeatedly scan and update a (lengthy) variable is going to be silly slow; for large volumes of replacements you really need to consider a different tool that only scans/updates the source once (eg, awk, perl, python), etc. Commented Apr 1 at 23:56
  • 1
    @markp-fuso If you look closely at my sed I have a \b . All properties calls must be terminated with a word boundary such as a punctuation mark or whitespace or will be ignored Commented Apr 2 at 0:22
  • 2
    What should the output from your sample input be if your properties file contained $foo=$hello instead of $foo=bar? Commented Apr 2 at 11:56
  • 2
    I think you'll need to be much more specific about the desired behaviour to get reliable answers. As presented so far, there are many edge cases and complexities that can cause unexpected/undesired output. Commented Apr 2 at 23:50

7 Answers 7

5

One idea/approach using bash and sed , you could try something like:

#!/usr/bin/env bash

while IFS='=' read -r prop_name prop_value; do
  if [[ "$prop_name" == "^"* ]]; then
     prop_name="\\${prop_name}"
  fi
  input_value+=("s/${prop_name}\\b/${prop_value}/g")
done < properties.txt

sed_input="$(IFS=';'; printf '%s' "${input_value[*]}")"

sed "$sed_input" sample_input.txt

One way to check the value of sed_input is

declare -p sed_input

Or

printf '%s\n' "$sed_input"

Sign up to request clarification or add additional context in comments.

10 Comments

this doesn't appear to abide by OP's word boundary requirement; add Leave first 2 'matches' alone: $foobar $hellow ^hello to the intput file; with word boundary matching only ^hello (end of line) should be replaced
Right, I made a solution from the given input and gave a desired output, thank you for the comment.
Hi. "${prop_name/#/\\}" soo just "\\$prop_name"? Isn't it the same. It looks way complicated then what it means. The other too.
It turns out the issue is with spawning subshells during high load not actual inefficiency with the the text processing, and this code is basically the same as the original minus the subshells, so I'll mark this as my accepted (also it's the one I ended up using)
Also bash is not written for text processing but here we are! What I'm saying is use the right tool for the job like awk and any other language that is written specifically for that kind of task you have now.
|
5

Adding additional lines to OP's input file to demonstrate word boundary matching and a property name occurring more than once in a line:

$ cat input.txt
This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

Leave first 2 matches alone: $foobar $hellow ^hello
^hello $foo $hello ^hello $foo $hello

Assumptions:

  • for word boundary matching it is sufficient to verify the character immediately after a matching property name is not an alphabetic character ([a-zA-Z]); otherwise we can expand the next_char testing (see awk code, below)

General idea:

  • read all properties.txt entries into an array (map[name]=value)
  • for each line from input.txt, loop through all names, checking for any word boundary matches to replace

One idea using awk:

$ cat replace.awk

FNR==NR { split($0,arr,"=")                             # 1st file: split on "=" delimiter
          map[arr[1]]=arr[2]                            # build map[name]=value array, eg: map[$foo]=bar
          len[arr[1]]=length(arr[1])                    # save length of "name" so we do not have to repeatedly calculate later
          next
        }

NF      { newline = $0                                  # 2nd file: if we have at least one non white space field then make copy of current input line

          for (name in map) {                           # loop through all "names" to search for 
              line    = newline                         # start over copy of current line
              newline = ""

              while ( pos = index(line,name) ) {        # while we have a match ...

                    # find next_character after "name"; if it is an
                    # alpha/numeric character we do not have a word
                    # boundary otherwise we do have a word boundary
                    # and we need to make the replacement with 
                    # map[name]=value
                    
                    next_char = substr(line,pos+len[name],1)

                    if (next_char ~ /[[:alnum:]]/)
                       newline = newline substr(line,1,pos+len[name]-1)
                    else
                       newline = newline substr(line,1,pos-1) map[name]

                    line = substr(line,pos+len[name])   # strip off rest of line to test for additional matches of "name"
              }
              newline = newline line                    # append remaining contents of line
          }
          $0 = newline                                  # overwrite current input line with "newline"
        }
1                                                       # print current line

NOTES:

  • most awk string matching functions (eg, sub(), gsub(), match()) treat the search pattern as a regex
  • this means those non-alphabetic characters in OP's properties file (eg, $, ^) will need to be escaped before trying to use sub() / gsub() / match()
  • instead of jumping through hoops to escape all special characters I've opted to use ...
  • the index() function treats search patterns as literal text (so no need to escape special characters)

Taking for a test drive:

$ awk -f replace.awk properties.txt input.txt
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world

For timing purposes I created a couple larger files from OP's properties file and my input.txt file (see above):

$ awk 'BEGIN {FS=OFS="="} {map[$1]=$2} END {for (i=1;i<=300;i++) {for (name in map) {nn=name x;print nn,map[name]};x++}}' properties.txt > properties.900.txt

$ for ((i=1;i<=250;i++)); do cat input.txt; done > input.1500.txt

$ wc -l properties.900.txt input.1500.txt
  900 properties.900.txt
 1500 input.1500.txt

Timing for the larger data files:

$ time awk -f replace.awk properties.900.txt input.1500.txt > output

real    0m0.126s
user    0m0.122s
sys     0m0.004s

$ head -12 output
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

Leave first 2 matches alone: $foobar $hellow goodbye
goodbye bar world goodbye bar world

NOTE: timing is from an Ubuntu 22.04 system (metal, vm) running on an Intel i7-1260P

Comments

4

What can I do to fix this?

Refactor your idea to write it in a single performant programming language. Bash is a shell - it executes other programs. Each program takes time to start.

You could generate sed script in one go and then execute it. Note that this will not handle ^hello or any other . * [ ? \ characters correctly, as sed works with regex. ^ matches beginning of a line.

sed "$(sed 's/\([^=]*\)=\(.*\)/s`\1\\b`\2`g/g' "$PROPERTIES_FILE")" "$INPUT_FILE"

You could escape the special characters with something along like this. See also https://stackoverflow.com/a/2705678/9072753 .

sed "$(sed 's/[]\/$*.^&[]/\\&/g; s/\([^=]*\)=\(.*\)/s`\1\\b`\2`g/g; ' "$PROPERTIES_FILE")" "$INPUT_FILE"

Notes: use shellcheck. Use $(...) instead of backticks. Do not abuse cats - just use <file instead of <<<$(cat "$PROPERTIES_FILE"). Don't SCREAM - consider lowercase variables. Consider m4, envsubst or jinja2 or just cpp for templating.

Comments

4

Your code seems to run in O(m.n) time looking for m possible properties in input of size n.

Since "both the properties file and the input files are hundreds/thousands of lines long", improving this to O(n) time may provide a noticeable speedup:

perl -e '
    # load mapping data into hash
    while ( ($k,$v) = split "=",<<>>,2 ) {
        chomp $v;
        $k2v{$k} = $v;
        last if eof;
    }

    # build regex from all keys (\Q escapes regex metacharacters)
    $re = join "|", map qr/\Q$_\E/, keys %k2v;

    # load input file as single string
    undef $/;
    $_ = <<>>;

    # convert all properties simultaneously
    s/($re)\b/ $k2v{$1} /ge;

    # output the result
    print;

' propfile inputfile

This makes use of a Perl regex optimisation that allows checking literal string alternations in constant, instead of linear, time.

I assume that recursive rewriting is not desired. For example, applying:

$key1=$key2
$key2=value

to blah $key1 should result in blah $key2 and not blah value


It may also be possible to process multiple inputfile in a loop so that the mapping data only needs to be loaded once, but it will be necessary to add some additional code to save each output instead of just writing to stdout.

7 Comments

This is lighting fast! :-)
or perhaps it is O(m+n) rather than O(n), since building the regex is linear in the alternations. should still be much better than O(m.n)
@markp-fuso how long does this code take with your test files?
@markp-fuso yay :-)
a recursive (not quite right word) solution might be to wrap the s/// in a while loop. infinite looping could be detected by incrementing a counter on each iteration and testing it hasn't exceeded some maximum value. eg. while ( ++$ct<=$max && s/.../ge ) {} warn "max loop count exceeded\n" if $ct>$max;. Another option could be to preprocess property files so that multiple rounds of replacements are not required in the first place.
|
4

I agree that this would be a lot more efficient in awk or perl or python, etc...

But to answer the question asked, yes, you can make this a lot more efficient with the tools you have. As mentioned, get rid of the time wasters. Your original code spawns unnecessary processes on practically every line.

Just have the code make one pass through the file to write all the individual sed substitution commands out to another script file (or accumulate them into a string as Jetchisel suggests) and then run that.

$ cat props
$foo=bar
$hello=world
^hello=goodbye

$ cat editme
This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

$ cat editme.new
cat: editme.new: No such file or directory

$ cat script
#!/bin/bash
date +'Inital timestamp: %D %T %N' >&2
{ printf '%s\n' '#!/bin/bash' "time sed '"
  date +'Starting read of props file: %D %T %N' >&2
  while IFS='=' read -r k v;
  do printf '  s`%q\\b`%q`g;\n' "$k" "$v"
  done < props
  date +'Closing sed command: %D %T %N' >&2
  printf '%s\n' "' editme > editme.new"
} > editor
date +'Done writing sed script file: %D %T %N' >&2
cat editor
. editor

The two time outputs at the bottom are for the one run of sed and the whole script, respectively.

$ time ./script
Inital timestamp: 04/01/25 20:21:21 799559000
Starting read of props file: 04/01/25 20:21:21 812337100
Closing sed command: 04/01/25 20:21:21 824636300
Done writing sed script file: 04/01/25 20:21:21 837483000
#!/bin/bash
time sed '
  s`\$foo\b`bar`g;
  s`\$hello\b`world`g;
  s`\^hello\b`goodbye`g;
' editme > editme.new

real    0m0.014s
user    0m0.000s
sys     0m0.016s

real    0m0.104s
user    0m0.075s
sys     0m0.046s

and afterwards -

$ cat editme.new
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

addendum

Most bash scripts benefit a lot from moving subshells to built-ins.

A simplified version of my sed-based script above:

$ cat script
#!/bin/bash
{ printf '%s\n' '#!/bin/bash' "sed '"
  while IFS='=' read -r k v; do printf '  s`%q\\b`%q`g;\n' "$k" "$v"; done < props
  printf '%s\n' "' editme > editme.new"
} > editor
. editor

$ time ./script

real    0m0.043s
user    0m0.000s
sys     0m0.031s

Using simple bash string processing for the whole thing -

$ cat v2
#!/bin/bash
text="$(<editme)"
while IFS='=' read -r k v;
do while [[ "$text" =~ "$k" ]]; do text="${text//$k/$v}"; done
done < props
echo "$text" > editme.2

$ time ./v2

real    0m0.011s
user    0m0.015s
sys     0m0.000s

$ diff editme.new editme.2

This performs horribly on a large file, though, for a lot of reasons. I made a file of nearly 400MB and the sed script handled it in about 12.5s. I broke the all-in-memory all-bash version just under 3m.

Comments

2

This will produce the output you show from the input you show, using any awk:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    NR == FNR {
        pos = index($0, "=")
        tag = substr($0, 1, pos - 1)
        val = substr($0, pos + 1)

        # Make any regexp metachars in the tag literal 
        gsub(/[^^\\[:alnum:]]/, "[&]", tag)
        gsub(/\\/, "&&", tag)
        gsub(/\^/, "\\\\&", tag)

        tags2vals[tag] = val
        next
    }
    {
        for ( tag in tags2vals ) {
            if ( match($0, tag) ) {
                val = tags2vals[tag]
                $0 = substr($0, 1, RSTART-1) val substr($0, RSTART+RLENGTH)
            }
        }
        print
    }
' props input
$ ./tst.sh
This is a story about world. It starts at a bar and ends in a park.

Bob said to Sally "goodbye, see you soon"

That was run against the sample input you provided:

$ head props input
==> props <==
$foo=bar
$hello=world
^hello=goodbye

==> input <==
This is a story about $hello. It starts at a $foo and ends in a park.

Bob said to Sally "^hello, see you soon"

but if your real input can contain recursive property definitions ($foo=$hello) and/or substrings in the input (this is $foobar here) you do not want to match then you'd need to enhance it to handle those however you want them handled.

See Is it possible to escape regex metacharacters reliably with sed (it's a sed question but the issue of escaping regexp metachars applies to awk too) for what the gsub()s are doing in the script.

9 Comments

this doesn't account for OP's word boundary requirement (\b in their sed script; confirmation in comments), eg, add Leave first 2 matches alone: $foobar $hellow ^hello to the input file; also, this doesn't handle a 'name' showing up more than once in a line, eg, add ^hello $foo $hello ^hello $foo $hello to the input file
@markp-fuso Yup, lots of opportunities for improvement if/when the OP provides requirements and an example we can test all the rainy-day scenarios against. I didn't want to waste time making it so robust it gets complicated and then the OP picks a briefer solution that has most/all of the same issues!
fwiw, running this against a 900-line prop file and a 1500-line input file takes about 3.5 seconds; sure, much better than OP's current code but sub-second total time should be doable; my (somewhat) verbose awk script, using index(), takes about 0.13 seconds for the same 900-/1500-line file setup; I guess this makes sense with match() regex matching requiring more cpu cycles than index() text string matching ...
@markp-fuso I did, yes, as I was trying to "escape" every character to ensure it's literal but it's good to know that not including alnums in bracket expressions creates a performance improvement, thanks, I can't think of any functional issue that might cause so I updated my answer accordingly.
with the latest update the run time for the 900/1500-line test files drops from 3.5 seconds down to 2.5 seconds
|
1

Hmmm..... you write

INPUT=`cat $INPUT_FILE`  # you start a process here for cat.
while read line
do  
    PROP_NAME=`echo $line | cut -f1 -d'='`  # you start two processes here
    PROP_VALUE=`echo $line | cut -f2- -d'='`   # you start two processes here
    time INPUT="$(echo "$INPUT" | sed "s\`${PROP_NAME}\b\`${PROP_VALUE}\`g")" # two more processes
done <<<$(cat "$PROPERTIES_FILE")# Do more stuff with INPUT  # and one more here.

You don't specify the format of $INPUT_FILE or $PROPERTIES_FILE, so it's little help I can give you, but I suggest you to put everything in a single pipeline of commands, each making some process to the whole data set. Something like:

# this command generates substitution commands for all property name and value pairs
# in the form or `-e` commands for _sed(1)_ to make all substitutions in one shot.
# the pairs are in format VARIABLE = value  --->  -e 's"@VARIABLE@"value"g'
SED_PARAMETERS=$(
sed -E                                                                         \
    -e 's"^\s*([A-Za-z_][A-Za-z0-9_]*)\s*=\s*(.*)$"-e '\''s\"@\1@\"\2\"g'\''"' \
    < "${PROPERTIES_FILE}" # first group is PROP_NAME, second is PROP_VALUE
)
# then, a sed command is run with this parameters on the input to produce a
# substitution made file
sed -E ${SED_PARAMETERS} < "${INPUT_FILE}" # outputs the transformed file to stdout.
    

This way, only two programs are run:

  1. sed is run on the properties file to generate a set of -e s"@param_name@"param_value"g parameters (according to what is read from the parameters file) to be used in a single run of the input file to change all parameters in one shot.

  2. another set is run with the parameters above, to change all ocurrences of @parameter_name@ to parameter_value output is to stdout, so you can chain it an pipe to another file.

if you let this on a shell script and use stdin to feed the second sed(1) command, then you can make parameter substitution on the fly. I use this approach to put all configuration parameters in a config.mk file, that is parsed to generate the Makefile configurables, the source code configurables and the documentation configurables in a single file.

4 Comments

with GNU sed 4.8 this generates sed: -e expression #1, char 2: unknown command: 'f'
@markp-fuso, sorry but I extracted the code from an actual example used in building programs with make, but as extracted, it will probably won't work, and is published only as reference (as is :)) The idea here was to give you a way to introduce pipelines and never use loops that spawn 6 processes per cycle, which is an unnecessary waste of resources (even today) I can work a complete solution, but as the example code that works is used at work, I cannot publish a complete example. My apologies.
people finding this Q&A who try to use your answer will find it fails which means this answer isn't really an answer; I don't understand what your work has to do with writing an answer the addresses the OP's question ... ?
@markp-fuso, Sorry, I'll erase it as soon you have acknowledge to have read this comment. I never tried anything but to help, even if it requires you to do some work. I can solve that problem and give you a complete working solution. But I'll not do. I provided it with the intention to teach, not to solve any other''s problem. You insist in that it is not complete, only tells me that you want to see it working so you can use it in your situation. Feel free to edit it and I'll aprove the edition if it satisfies me. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.