Extract Regex Capture Group in Script

Question

I am writing a CSH script and attempting to extract text from a source string given a key.

!/bin/csh -f
set source = "Smurfs\n\tPapa\nStar Trek\n\tRenegades\n\tStar Wars\n\tThe Empire Strikes Back\n"
set toFind = "Star Trek"
set regex = "$toFind[\s]*?(.*?)[\s]*?"
set match = `expr $source : $regex`
echo $match

The above code does not work, so I am missing something. I tried placing "Star Trek" directory inside rather than a variable. I should see Regenages as the answer. Had I put "Star Wars" as instead of "Star Trek", I should have seen The Empire Strikes Back.

Google search showed a possible solution using grep, such as

match = `grep -Po '<something>' <<< $source

I did not know what to put for <something>, nor am I an expert in grep.

In the real code, I am reading text from a file. I just simplified things here.

Thoughts?

grep is for matching, sed is able to edit the stream, this is a good introduction: grymoire.com/Unix/Sed.html - also has examples on how to combine with shell scripts including csh. — CAAHS
– CAAHS, Commented Nov 13, 2023 at 21:28
@mandy8055 Your bash script returns "Star Trek" and not "Renegades", so directly as written no. That being said, I am open to a bash solution, though would still leave my original question up, as I am curious if a solution is possible in csh. — Sarah Weinberger
– Sarah Weinberger, Commented Nov 13, 2023 at 22:38
"Thoughts?" ... Your reg-ex looks very much like perl reg-ex (but I have no experience with that) . SO, if that is a perl-reg ex, you can be sure that unless you have a version of expr that supports perl-regex, that will never work. BUT now I am reading your initial problem descrip, "attempting to extract text from a source string given a key.". ?? key/values. Why are you using such an unhelpful solution? why not key[str]="value" or even just myKey=Renegades ? Ah, " I am reading text from a file." it might have helped to have that near the top of your Q. ..... — shellter
– shellter, Commented Nov 14, 2023 at 17:51
Following on, as you say ". I just simplified things here." I would rather spend my csh time on converting 2 lines of input into variable assignments, but it seems you have to deal with spaces in your var-names, so nix to Star Trek="Renegade" )-; . Doing quick research, I don't see that csh can do arr[key]="value" arrays, only set arr = (one two three), which are then referenced as echo $arr[1] $arr[3] etc. If you're processing a file with an extenal utility, the sed is good, but awk will give you much more understandable code. Busy now, so that's all I can come up with now. — shellter
– shellter, Commented Nov 14, 2023 at 18:08
Back the the perl-regex thing, There is a small set of perl-regexps special syntax that can be rewritten in long-hand basic regexp. I have to believe that the expr utility only uses basic regexs, but it's not documented in GNU coreutils 8.30 version of man expr. ( maybe in info '(coreutils) expr invocation'? ). You do know that using csh is shell scripting w one hand timed behind your back? OK as a learning challenge, but jobs/work, you'll do much better getting good at bash or zsh or something even newer (fish?) (man grep search for ERE is the best I can find). Good luck. — shellter
– shellter, Commented Nov 14, 2023 at 18:16

Sarah Weinberger · Accepted Answer · 2023-11-14 17:44:16Z

0

The following is not a literal answer to my question, as I asked the question for csh, however I wrote a solution using bash.

Match Regex Capture Groups

Match Whitespace How can I match spaces with a regexp in Bash?

I used Tutorial Point to debug.

mystring1='  asdf1@wxyz2  @@a!s#d@f@@  asdf2@wxyz2 b!t#e@g '

tofind='asdf1@wxyz2'
regex="${tofind}[[:space:]]*([.!@\#a-zA-Z0-9]+)"

[[ $mystring1 =~ $regex ]]

echo $'\n'
echo $'\n'
echo '***********************'
echo ${BASH_REMATCH[1]}
echo '***********************'

edited Nov 14, 2023 at 17:44

answered Nov 14, 2023 at 16:09

Sarah Weinberger

15.7k25 gold badges88 silver badges135 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ed Morton Over a year ago

mystring1=' asdf1@wxyz2 @@a!s#d@f@@ asdf2@wxyz2 b!t#e@g ' is not in the same newlines+tabs separated format as the text in your question, set source = "Smurfs\n\tPapa\nStar Trek\n\tRenegades\n\tStar Wars\n\tThe Empire Strikes Back\n". This is a possible answer to a different question than the one you asked.

Sarah Weinberger · Accepted Answer · 2023-11-14 22:30:29Z

0

The real solution uses a file for the source, so is:

set valueCapture=`cat /mypath/filename | grep -A1 "${tofind}" | grep -v "${tofind}" | xargs`

The code to find a capture value from a string should be (did not test it):

set valueCapture=`cat $source | grep -A1 "${tofind}" | grep -v "${tofind}" | xargs`

In both cases, the what I wish to find is:

set tofind='asdf1@wxyz2'

The xargs part trims off whitespace.

answered Nov 14, 2023 at 22:30

Sarah Weinberger

15.7k25 gold badges88 silver badges135 bronze badges

2 Comments

Ed Morton Over a year ago

That's doing partial regexp matching across whole lines when you almostcertainly should be doing whole-line or whole-field string matching, and it'd fail if the same target string appeared in both lines.

Paul Hodges Over a year ago

Also, UUoC. Drop the cat file and just grep -A1 "${tofind}" file. From a string you might use echo "$source", but not cat`.

Ed Morton · Accepted Answer · 2023-11-15 20:17:41Z

0

Since you said your real input is in a file, here's the file your printf outputs:

$ cat file
Smurfs
        Papa
Star Trek
        Renegades
        Star Wars
        The Empire Strikes Back

and here's how to match and print the strings you want from it:

$ awk -v tgt='Star Trek' '{gsub(/^[[:space:]]+|[[:space:]]+$/,"")} $0==tgt{n=NR+1} NR==n' file
Renegades

$ awk -v tgt='Star Wars' '{gsub(/^[[:space:]]+|[[:space:]]+$/,"")} $0==tgt{n=NR+1} NR==n' file
The Empire Strikes Back

See why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

answered Nov 15, 2023 at 20:17

Ed Morton

209k18 gold badges90 silver badges212 bronze badges

Comments

Paul Hodges · Accepted Answer · 2023-11-17 21:28:57Z

A pipeline can do it, though it isn't as good as Ed's single process awk.

$: toFind="Star Wars"; echo "$source" |  grep -EA1 "$toFind" | tail -1
        The Empire Strikes Back

$: toFind="Star Trek"; echo "$source" |  grep -EA1 "$toFind" | tail -1
        Renegades

$: echo "$source">file; toFind="Star Trek"; grep -EA1 "$toFind" file | tail -1
        Renegades

A sed would work.

$: toFind="Star Trek"; sed -n "/$toFind/{n
                                         p}" file # should work with any version
        Renegades

$: toFind="Star Wars"; sed -n "/$toFind/{n;p}" file # semicolon is GNU
        The Empire Strikes Back

All of these are probably worth refining your regex.

$: toFind="Star"; sed -n "/$toFind/{n;p}" file
        Renegades
        The Empire Strikes Back

$: toFind="Star"; sed -n "/^$toFind$/{n;p}" file

$: toFind="Star Trek"; sed -n "/^$toFind$/{n;p}" file
        Renegades

$: toFind="Star Wars"; sed -n "/^$toFind$/{n;p}" file # fails because of the leading tab

That last one might mean you have to allow the first one.
Test your logic.

Collectives™ on Stack Overflow

Extract Regex Capture Group in Script

4 Answers 4

1 Comment

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related