1

I'm having some problems with regex in Perl.

I'm having a line: #23 = CARTESIAN_POINT ( 'NONE', ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 ) ) ;

And I want to split the line into different values.

Right now I have (#[0-9]+)\s=\s([A-Z]+_[A-Z]+)\s(.*) this. This will have these values as output:

$array[0]=#23
$array[1]=CARTESIAN_POINT
$array[2]=( 'NONE',  ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 ) ) ;

I want this line: ( 'NONE', ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 ) ) ; to split up to different values like.

PARAM[0] = 'NONE',
PARAM[1] = ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 )

or

PARAM[0] = 'NONE',
PARAM[1] = -1.822612853216911200
PARAM[2] = 55.22284222837789300
PARAM[3] = 8.566382866014988600

But I can't quite figure out how to do it. I tried different things but none of them is mentioning worthy.

I hope someone is able to help me or point me in the right direction. Thanks in advance!

4 Answers 4

3

This is fairly straightforward when broken into multiple (two) steps.

First extract the text with coordinates, the stuff inside CARTESIAN_POINT( ... )

my ($coord_text) = $string =~ /= \s+ [A-Z_]+ \s+ \( \s* (.+) \s* \)/x;

where /x allows for those spaces inside, for readability. The .+ is greedy and gets everything up to the very last ), including the nested (...). Then get coordinates out of that

my @coords = $coord_text =~ /([A-Z]+|[0-9-.]+)/g;

Here we allow either a word (like that NONE), or a number (in shown format).

Altogether, with the intermediate step "hidden" inside a do lexical scope

use warnings;
use strict;
use feature 'say';

my $string = q(#23 = CARTESIAN_POINT ( 'NONE', ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 ) ) ; );

my @coords = do {
    my ($coord_text) = $string =~ /=\s+[A-Z_]+\s+\(\s*(.+)\s*\)/; 
    $coord_text =~ /([A-Z]+|[0-9-.]+)/g;
};

say for @coords; 

This is easily tweaked for variations in requirements/outcomes, slight or major

  • To capture quotes around NONE as well (shown in OP), add quotes to the character class for the word, [A-Z\x22\x27]. I use hex in case this is a "one-liner" in a bash script or some such, since context isn't specified. In a normal script you can use " and '

  • To get numbers in a string instead of a list, as mentioned in the question, use

    $coord_text =~ /([A-Z]+|\([^)]+\))/g;
    

    instead of the second statement in the do block above

I assume that you have a list containing either words (like NONE) or straight lists of coordinates (numbers), without any further nesting or similar syntactic complexities.

Note  If the input can be a multiline string then add /s modifier to the regex. With it the . matches a newline as well and it all works the same as above (it does in my tests). This should only be needed in the first regex, making it

my ($coord_text) = $string =~ /=\s+[A-Z_]+\s+\(\s*(.+)\s*\)/s;

but it won't hurt in the other one either.


 The used character class [0-9-.] also allows garbage (like -.-2 etc). If you need to confirm that you indeed have a number in the given format please add checks for that. The best way to test for a number is looks_like_number from Scalar::Util.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the detailed explanation! I’ll try it as soon as I can.
@mHvNG You are most welcome. Note that I edited a little in the meanwhile, and in particular I just added a "Note" at the end about how to modify it to work with multiline stirngs.
Thanks for the help. Its really appreciated!
3

This is what Text::Balanced is for.

#!/usr/bin/perl

use strict;
use warnings;

use Text::Balanced qw[extract_bracketed];
use Data::Dumper;

while (<DATA>) {
  # Extract the bit of your string between the first and last brackets
  my $extracted = extract_bracketed($_, '(', '[^()]*');
  # Then split what's left on strings of brackets, whitespace and commas.
  # But grep the list to remove any zero-length strings that you get.
  my @bits = grep { length } split /[\(\)\s,]+/, $extracted;
  print Dumper \@bits;
}

__DATA__
#23 = CARTESIAN_POINT ( 'NONE',  ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 ) ) ;

Output:

$VAR1 = [
          '\'NONE\'',
          '-1.822612853216911200',
          '55.22284222837789300',
          '8.566382866014988600'
        ];

Comments

0

You need to repeat your pattern as many times as needed and supply the appropriate capture groups:

#[0-9]+\s*=\s*[A-Z]+_[A-Z]+\s*\(\s*'([A-Z]+)',\s*\(\s*(-?\d+\.\d+),\s*(-?\d+\.\d+),\s*(-?\d+\.\d+)

https://regex101.com/r/GJ6yDi/1/

1 Comment

Thanks! I'll try it as soon as I can.
0

If you don't care about the nesting, and just want to get all the "values" into an array, you might consider the simpler solution of just splitting on a discard of all unwanted (non-value) characters: /[(),;=\s]+/

$ cat line
#23 = CARTESIAN_POINT ( 'NONE',  ( -1.822612853216911200, 55.22284222837789300, 8.566382866014988600 ) ) ;

$ perl -ne '@array = split /[(),;=\s]+/; print join "|", @array; print "\n"' line
#23|CARTESIAN_POINT|'NONE'|-1.822612853216911200|55.22284222837789300|8.566382866014988600


Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.