1

I am trying to read a large text file, containing variable names and corresponding values (see below for small example). Names are all upper case and the value is usually separated by a periods and whitespaces, but if the variable name is too long it is separated by only whitespaces.

WATER DEPTH ..........     20.00 M      TENSION AT TOUCHDOWN .    382.47 KN  

TOUCHDOWN X-COORD. ...   -206.75 M      BOTTOM SLOPE ANGLE ...     0.000 DEG 

PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M  

I am able to find the values using the following expression:

line = '   PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M   \n'
re.findall(r"[-+]?\d*\.\d+|\d+", line):
['166.74', '1.72']

But when I try to extract the variable names, using below expression I have leading and trailing whitespaces which I would like to leave out.

re.findall('(?<=\s.)[A-Z\s]+', line)
[' PROJECTED SPAN LENGTH     ', '      PIPE LENGTH GAIN ', '    ', '   \n']

I believe it should have something like ^\s, but I can't get it to work. When successful I'd like to store the data in a dataframe, having the variable names as indices and the values as column.

1
  • 2
    Use r'[A-Z]+(?:\s+[A-Z]+)*' Commented Aug 23, 2016 at 14:16

3 Answers 3

1

You can use the following expression along with re.finditer():

(?P<category>[A-Z][A-Z- ]+[A-Z])
[. ]+
(?P<value>-?\d[.\d]+)\ 
(?P<unit>M|DEG|KN)

See a demo on regex101.com.


In Python this would be:

import re

rx = re.compile(r'''
    (?P<category>[A-Z][A-Z- ]+[A-Z])
    [. ]+
    (?P<value>-?\d[.\d]+)\ 
    (?P<unit>M|DEG|KN)
''', re.VERBOSE)

string = '''
WATER DEPTH ..........     20.00 M      TENSION AT TOUCHDOWN .    382.47 KN  

TOUCHDOWN X-COORD. ...   -206.75 M      BOTTOM SLOPE ANGLE ...     0.000 DEG 

PROJECTED SPAN LENGTH     166.74 M      PIPE LENGTH GAIN .....      1.72 M  
'''

matches = [(m.group('category'), m.group('value'), m.group('unit')) \
            for m in rx.finditer(string)]
print(matches)
# [('WATER DEPTH', '20.00', 'M'), ('TENSION AT TOUCHDOWN', '382.47', 'KN'), ('TOUCHDOWN X-COORD', '-206.75', 'M'), ('BOTTOM SLOPE ANGLE', '0.000', 'DEG'), ('PROJECTED SPAN LENGTH', '166.74', 'M'), ('PIPE LENGTH GAIN', '1.72', 'M')]

See a demo on ideone.com.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks Jan, this is a very neat solution and regex101.com is also pretty handy. As such, I have taken the liberty to pose another question to you for which I would like to give the link to the problem [link] regex101.com/r/nK3hN6/1 In my previous question I had only posted a section of the text to analyze, but there are some more lines, with which I have some difficulty as well. Lines with no units for example. Thanks in advance
@EmielT: regex101.com/r/nK3hN6/2 (make the last group optional and put the longest alternatives first.
Perfect, many thanks! It has given me at least a bit more insight in regex.
0

If you ever want to take out leading/trailing white space, you can use the .strip() method.

Python String strip

stripped_values = [raw.strip() for raw in re.findall('(?<=\s.)[A-Z\s]+', line)]

Comments

0

Use [A-Z]{2,}(?:\s+[A-Z]+)*

[A-Z]{2,} looks for uppercase words at least 2 in length

(?:\s+[A-Z]+)* is a capture group for if there are multiple words in the label

EDIT

To handle the case in your comment I'd recommend:

[A-Z-\/]{2,}(?:\s*[A-Z-\/]+(?:\.)*)*

just make sure there is at least one space after the last period in R.O.W. and before the ...

[A-Z-\/]{2,} will check for uppercase letters, -, and / of 2 length or greater

(?:\s*[A-Z-\/]+(?:\.)*)* is a capture group for for multiple words and/or words with periods in them

1 Comment

Thanks depperm, this works rather well. However, for the seond last row TOUCHDOWN X-COORD. is being split into TOUCHDOWN and COORD. Ok, this can be fixed by escaping the character in non-capture group. However in the text file the following may also occur: WEIGHT/LENGTH IN AIR . 1301.00 N/M YIELD STRESS ......... 241.00 MPA or BARGE HEADING ........ 0.000 DEG OFFSET FROM R.O.W. ... 0.00 M. Here R.O.W. for example is not found, which I believe can be caught by using a lookbehind/lookahead statement. Could you please advise on how to implement this as well? Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.