Extract variable names and values using REGEX in Python from a text file

Question

I am trying to read a large text file, containing variable names and corresponding values (see below for small example). Names are all upper case and the value is usually separated by a periods and whitespaces, but if the variable name is too long it is separated by only whitespaces.

WATER DEPTH ..........     20.00 M      TENSION AT TOUCHDOWN .    382.47 KN  

TOUCHDOWN X-COORD. ...   -206.75 M      BOTTOM SLOPE ANGLE ...     0.000 DEG 

PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M

I am able to find the values using the following expression:

line = '   PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M   \n'
re.findall(r"[-+]?\d*\.\d+|\d+", line):
['166.74', '1.72']

But when I try to extract the variable names, using below expression I have leading and trailing whitespaces which I would like to leave out.

re.findall('(?<=\s.)[A-Z\s]+', line)
[' PROJECTED SPAN LENGTH     ', '      PIPE LENGTH GAIN ', '    ', '   \n']

I believe it should have something like ^\s, but I can't get it to work. When successful I'd like to store the data in a dataframe, having the variable names as indices and the values as column.

Use r'[A-Z]+(?:\s+[A-Z]+)*'

Wiktor Stribiżew
– Wiktor Stribiżew

2016-08-23 14:16:39 +00:00
Commented Aug 23, 2016 at 14:16 — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Aug 23, 2016 at 14:16

Jan · Accepted Answer · 2016-08-23 17:29:32Z

1

You can use the following expression along with re.finditer():

(?P<category>[A-Z][A-Z- ]+[A-Z])
[. ]+
(?P<value>-?\d[.\d]+)\ 
(?P<unit>M|DEG|KN)

See a demo on regex101.com.

In Python this would be:

import re

rx = re.compile(r'''
    (?P<category>[A-Z][A-Z- ]+[A-Z])
    [. ]+
    (?P<value>-?\d[.\d]+)\ 
    (?P<unit>M|DEG|KN)
''', re.VERBOSE)

string = '''
WATER DEPTH ..........     20.00 M      TENSION AT TOUCHDOWN .    382.47 KN  

TOUCHDOWN X-COORD. ...   -206.75 M      BOTTOM SLOPE ANGLE ...     0.000 DEG 

PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M  
'''

matches = [(m.group('category'), m.group('value'), m.group('unit')) \
            for m in rx.finditer(string)]
print(matches)
# [('WATER DEPTH', '20.00', 'M'), ('TENSION AT TOUCHDOWN', '382.47', 'KN'), ('TOUCHDOWN X-COORD', '-206.75', 'M'), ('BOTTOM SLOPE ANGLE', '0.000', 'DEG'), ('PROJECTED SPAN LENGTH', '166.74', 'M'), ('PIPE LENGTH GAIN', '1.72', 'M')]

See a demo on ideone.com.

edited Aug 23, 2016 at 17:29

answered Aug 23, 2016 at 15:01

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

EmielT Over a year ago

Thanks Jan, this is a very neat solution and regex101.com is also pretty handy. As such, I have taken the liberty to pose another question to you for which I would like to give the link to the problem [link] regex101.com/r/nK3hN6/1 In my previous question I had only posted a section of the text to analyze, but there are some more lines, with which I have some difficulty as well. Lines with no units for example. Thanks in advance

Jan Over a year ago

@EmielT: regex101.com/r/nK3hN6/2 (make the last group optional and put the longest alternatives first.

EmielT Over a year ago

Perfect, many thanks! It has given me at least a bit more insight in regex.

gregbert · Accepted Answer · 2016-08-23 14:24:44Z

0

If you ever want to take out leading/trailing white space, you can use the .strip() method.

Python String strip

stripped_values = [raw.strip() for raw in re.findall('(?<=\s.)[A-Z\s]+', line)]

answered Aug 23, 2016 at 14:24

gregbert

5462 silver badges5 bronze badges

Comments

depperm · Accepted Answer · 2016-08-24 13:51:58Z

0

Use [A-Z]{2,}(?:\s+[A-Z]+)*

[A-Z]{2,} looks for uppercase words at least 2 in length

(?:\s+[A-Z]+)* is a capture group for if there are multiple words in the label

EDIT

To handle the case in your comment I'd recommend:

[A-Z-\/]{2,}(?:\s*[A-Z-\/]+(?:\.)*)*

just make sure there is at least one space after the last period in R.O.W. and before the ...

[A-Z-\/]{2,} will check for uppercase letters, -, and / of 2 length or greater

(?:\s*[A-Z-\/]+(?:\.)*)* is a capture group for for multiple words and/or words with periods in them

edited Aug 24, 2016 at 13:51

answered Aug 23, 2016 at 14:24

depperm

10.8k4 gold badges46 silver badges68 bronze badges

1 Comment

EmielT Over a year ago

Thanks depperm, this works rather well. However, for the seond last row TOUCHDOWN X-COORD. is being split into TOUCHDOWN and COORD. Ok, this can be fixed by escaping the character in non-capture group. However in the text file the following may also occur: WEIGHT/LENGTH IN AIR . 1301.00 N/M YIELD STRESS ......... 241.00 MPA or BARGE HEADING ........ 0.000 DEG OFFSET FROM R.O.W. ... 0.00 M. Here R.O.W. for example is not found, which I believe can be caught by using a lookbehind/lookahead statement. Could you please advise on how to implement this as well? Thanks

Collectives™ on Stack Overflow

Extract variable names and values using REGEX in Python from a text file

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related