how to get string and it's value with regex

Question


Name        Miss deks KUMARI                    Booking Date           22/05/2020 
             Gender/Age  male  24 Yrs                        Reporting Date         22/05/2020 
             Lab No.     10203693                              Sample Collected At    Lab 
             Ref. By Dr. I.C.U 
                  ;                                                                          UVLO 
             Test Name                                  Value         Unit            Biological Ref Interval 
                                           COMPLETE   BLOOD   COUNT (CBC) 
             TOTAL LEUCOCYTES    COUNT (TLC)            23160         cells/cmm       4000 - 11000 
             DIFFERENTIAL LEUCOCYTES  COUNT (DLC) 
             NEUTROPHILS                                93.4          %               45.0 - 65.0 
             LYMPHOCYTES                                 3.3          %               20.0 - 45.0 
             MONOCYTES                                   3.1          %               4.0 - 10.0 
             EOSINOPHILS                                0.2           %               0.0 - 5.0 
             BASOPHILS                                   0.0          %               0.0-1.0 
             ABSOLUTE   NEUTROPHILS                      21620.0                      3000.0 - 7000.0 
             ABSOLUTE   LYMPHOCYTES                      750.0                        800.0 - 4000.0 
             ABSOLUTE  MONOCYTES                         730.0                        0.0 - 1200.0 
             ABSOLUTE  EOSINOPHILS                       50.0                         0.0 - 500.0 
             ABSOLUTE  BASOPHILS                         10.0                         0.0 - 100.0 
             RBC  COUNT                                  4.31         Millions/cmm    3.80 - 5.80

this is a text file and i want to this kind of output using regex

if i search NEUTROPHILS i want it's value 93.4

if i search BASOPHILS i want it's value 0.0, something like that.

only first two columns needed, i tried to implement once regex ^[^\S\r\n]*(\S+)[^\d\r\n]+(\d+(?:\.\d+)?)[^\d\r\n]*(\d+(?:\.\d+)?)?

but it returns all

someone please help me to get this

here is my list

         `["NEUTROPHILS"                                
         "LYMPHOCYTES"                               
         "MONOCYTES"                                   
         "EOSINOPHILS"                               
         "BASOPHILS"]`

i want to get like this-:

{
 "NEUTROPHILS"  :  93.4                            
 "LYMPHOCYTES"  :  3.3                           
 "MONOCYTES"    :  3.1                             
 "EOSINOPHILS"  :  0.2                         
 "BASOPHILS"    :  0.0 }

There are a number of ways to do this. What I've done in the past is go through the file line by line, regex find the actual line (if you use regex search, make sure to use the .string output to get the entire line), use .split() on the string, then index the value you want to extract. — samman
– samman, Commented Jul 13, 2020 at 17:43

Jan · Accepted Answer · 2020-07-13 19:16:44Z

3

You could use the following expression:

\b(?P<key>[A-Z][A-Z ]+)\b(?P<value>\d+(?:\.\d+)?)

Then, we need to clean the keys (remove unnecessary whitespaces) and think of a function, that returns the value for a given key. Optional: put it all in a class. That said, the code could be:

import re

class Finder:
    def __init__(self, haystack):
        self.db = self.build_db(haystack)

    def build_db(self, haystack):
        rx = re.compile(r'\b(?P<key>[A-Z][A-Z ]+)\b(?P<value>\d+(?:\.\d+)?)')
        ws = re.compile(r'\s+')

        return {ws.sub(' ', m["key"].strip()): m["value"] for m in rx.finditer(haystack)}

    def find_by_key(self, key):
        try:
            value = self.db[key]
        except KeyError:
            value = None
        return value

    def get_selected(self, lst):
        result = {}
        for key in lst:
            value = self.find_by_key(key)
            if value:
                result[key] = value
        return result

    def get_all(self):
        return self.db

cls = Finder(junk)
dct = cls.get_selected(["NEUTROPHILS", "LYMPHOCYTES", "MONOCYTES", "EOSINOPHILS", "BASOPHILS"])
print(dct)

Which would yield

{'NEUTROPHILS': '93.4', 'LYMPHOCYTES': '3.3', 
 'MONOCYTES': '3.1', 'EOSINOPHILS': '0.2', 'BASOPHILS': '0.0'}

See a demo for the expression on regex101.com.

edited Jul 13, 2020 at 19:16

answered Jul 13, 2020 at 17:44

Jan

43.3k11 gold badges57 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jose Over a year ago

Thanks for this, i just updated my question, can you please help me through that

Jan Over a year ago

@Jose: See the updated answer and Finder.get_selected().

Jose Over a year ago

can you please say , how to pass a lines of that .TXT file into Junk???

Dharman · Accepted Answer · 2020-07-13 17:58:32Z

1

You can try this simple regex for that. Your 1st column would be the 0th capture group and the 2nd column would be the 1st capture group: [A-Z]+\s+[A-Z]*\s+(\d+.\d*)

Explanation of above regex:

It first matches one or more uppercase letters
Then matches one or more spaces
Then again matches zero or mode uppercase letters (to cover space separated keys in your text)
The last part matches decimal digit(s).

Here is the demo on regex101.com

Note: This regex can be easily improved to add more restrictions.

edited Jul 13, 2020 at 17:58

Dharman♦

33.9k27 gold badges106 silver badges157 bronze badges

answered Jul 13, 2020 at 17:52

Kaushal28

5,5837 gold badges47 silver badges79 bronze badges

2 Comments

Jan Over a year ago

Change the expression from ...(\d+.\d*)to (\d+(?:\.\d+)?) otherwise things like 123f, 1232323232?, 222!222, etc. are considered valid.

Kaushal28 Over a year ago

@Jan Yes you are correct, that could match this regex but I've just given a high level hint (instead of spoon-feeding) and also included that you can add restrictions in this regex to exclude such cases. That should be tried and done by OP himself.

E_net4 · Accepted Answer · 2020-07-14 07:09:01Z

-1

I'm sure there are better ways to do this. But this is what I've done in the past:

with open(file.txt) as file: 
  for line in file:
    remove_white_spaces=line.strip()
    search=re.search('^\w+\s+\d+',remove_white_spaces)
    if search != None: 
      extract=(search.string).split()
      print(extract[1])

Granted you can change the search to the actual word if you'd like. I've written this fully out, however with list comprehension you could write this entire thing into 2 lines.

edited Jul 14, 2020 at 7:09

E_net4

30.6k13 gold badges118 silver badges155 bronze badges

answered Jul 13, 2020 at 17:51

samman

6331 gold badge13 silver badges33 bronze badges

Collectives™ on Stack Overflow

how to get string and it's value with regex

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related