0

The parser api (which I am not allowed to modify) gives me a string of this form:

    var1    var2  \
var2continued var2continued   \\\
var2continued
var3
var3continued \

var3continued

I want to split this string using regex such that:

$1 = "var1";
$2 = "var2  
var2continued var2continued   \\
var2continued"
$3 = "var3
var3continued \

var3continued"

Basically first variable is first non-space word after 1 or more spaces and end when space is encountered.

Second variable starts from first non-space character after first variable until line end. If last character is "\", add the next line to the second variable (don't trim white space between last character on cur line and "\"). "\" should not capture next line but returns both "\" (no escape). Only trim white space for last line.

Third variable is everything after second variable.

So far I've been able to come up with this regex which only works with one line for var2 and var3

$my_re = qr/\s+(\S+)\s+(\S+)\s+[\n](.*)/

$text =~ /$my_re/

1
  • Given the comments under answers, the description in the question omits important facts. (It is also a little hard to follow.) I suggest to be extra careful in formulating questions, as this text is the only thing we have to go with. Please recall that people who read this generally have absolutely no clue about your problem. Commented Jan 8, 2020 at 3:50

3 Answers 3

3

First word, then everything up to a newline immediately preceded by a non-slash; then all else

/\s+ (\S+) \s+ (.*?[^\\]) \n (.*)/xs;

The /s modifier makes it so that the . matches newline as well, critical here (normally it doesn't). The /x modifier makes it ignore literal spaces so we can make it more readable.


An example program

use warnings;
use strict;
use feature 'say';

my $v = 
q(    var1    var2  \
var2continued var2continued   \\\
var2continued
var3
var3continued \

var3continued);

$v =~ /\s+ (\S+) \s+ (.*?[^\\]) \n (.*)/xs;

say "\"$1\"";  say '---';
say "\"$2\"";  say '---';
say "\"$3\""; 

prints

"var1"
---
"var2  \
var2continued var2continued   \\
var2continued"
---
"var3
var3continued \

var3continued"
Sign up to request clarification or add additional context in comments.

7 Comments

Your backslash are escaping the next backslash. The \\\ is actually \\ when printed. My string is actually after escaping. Also I need to delete one backslash if odd for $2. I believe from googling that you can't skip characters in captured groups. So I had to loop through each line.
@Topa "backslash are escaping the next backslash" -- hum? Do you mean the [^\\] ...? Yes, it's escaping it so that we get a "not-backslash" character class. You can't simply place a backslash in a character class (like [\]) because it would actually escape the ], making it into a syntax error (because then the opening [ wouldn't get closed)
@Topa "My string is actually after escaping." -- heh ... your string is exactly what you showed in the question. People who read a question can't guess what you mean, and that "after escaping" thing isn't mentioned int he question (and I don't really know what you mean by that?)
@Topa "Also I need to delete one backslash if odd ..." -- again, you never mention that in the question. (The output you show indeed has two backslashes, and my answer reproduces that.)
@Topa "can't skip characters in captured groups" --- I don't understand what you mean by that. The regex in the answer works (and implements your description)
|
1

Try following piece of code (my take at the problem)

use strict;
use warnings;

my $str = do { local $/; <DATA> };

print "INPUT:\n[$str]\n";

$str =~ /(\w+)\s+(.*?\\\\\\\s*\w+)\n(.+)/s;
#$str =~ /(\w+)\s+((?:.*?)\\\\\\\s+(?:\w+)?)\n(.+)/s;

print "\n1: [$1]";
print "\n2: [$2]";
print "\n3: [$3]";

__DATA__
    var1    var2  \
var2continued var2continued   \\\
var2continued
var3
var3continued \

var3continued

output

INPUT:
[    var1    var2  \
var2continued var2continued   \\\
var2continued
var3
var3continued \

var3continued
]

1: [var1]
2: [var2  \
var2continued var2continued   \\\
var2continued]
3: [var3
var3continued \

var3continued
]

1 Comment

It doesn't delete one backslash for odd number of backslashes. I believe it is undoable in one regex.
0

None of the answers worked for all cases (2 and 3 are optional). I had a small issue where the parser was adding space after the backslash.

I ended up splitting the text into array of lines. Then splitting it into two parts (1 and 2 together and 3 by itself). Then I split it the first part by itself. My actual code is split into multiple functions but I simplified below:

my $empty_re = qr/^\s*$/;
my $def_re = qr/(.*?)((?:\\{2})*)(\\?)\s*$/;
my $dual_token_re = qr/\s*(\S+)\s*(.*)/s;
$text= "place text here"
my @lines = split /\n/, $text;
my $i;
my $j;
my $def = "";
my $other;
# Get start capture
for($i=0;$i<=$#lines;$i++){
    last if !($lines[$i] =~ /$empty_re/);
}

# Start definition capture
for($j=$i;$j<=$#lines;$j++) {
    $lines[$j] =~ s/$def_re/$1$2/; # remove ending backquote if odd
    last if !$3; # break if even backquotes
}
$def = join "\n", @lines[$i..$j];
$j++;

# Get remaining text
if ($j <= $#lines) {
    $other = join "\n", (splice @lines, $j);
}

# $def has 1 and 2, $other has 3

$def =~ /$dual_token/
# now $1 and $2 has 1 and 2, $other has 3

1 Comment

"... cases (2 and 3 are optional)" --- but the question doesn't give even a hint of that? On the contrary, it specifically makes statements about groups 2 and 3.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.