1

I want to extract relevant information from the text column of a DataFrame using regular expressions.

If I have this DataFrame:

+-------------------------------------------+
|text                                       |
+-------------------------------------------+
|Hello this is a test +34666666666 677777777|
|Hello this a test 44442222M 33335555C      |
+-------------------------------------------+

I would like to have the following output:

+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                       |match                                                                                                                                               |
+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello this is a test +34666666666 677777777|[(PHONE,u'34666666666',u'677777777')]|
|Hello this a test 44442222M 33335555C      |[(DNI, u'44442222M',u'33335555C')]                                                                                                      |
+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

I've try using this:

pat_list = [['PHONE','\\b((\+?34([ \t|\-])?)?[9|6|7]((\d{1}([ \t|\-])?[0-9]{3})|(\d{2}([ \t|\-])?'
        '[0-9]{2}))([ \t|\-])?[0-9]{2}([ \t|\-])?[0-9]{2})\\b'],
['DNI','\\b(\d{8})([A-Z])\\b']]

l1 = ['Hello this is a test +34666666666 677777777','Hello this a test 44442222M 33335555C']
l = zip(l1)
df = spark.createDataFrame(l,['text'])
rdd = df.rdd.map(list)

def parse_pat(row, col_number, patterns):
    column = row[col_number]
    hit_words = []
    for patron in patterns:
        patron_comp = re.compile(patron[1], re.IGNORECASE)
        match = patron_comp.search(column)
        if match:
            hit_words.append(patron[0])
            find= re.findall(patron[1], column, re.DOTALL)
            hit_words.append(' '.join(str(x) for x in find))
    myrow = list(row)
    myrow.append(hit_words)
    return myrow

rdd_parse_pat = rdd.map(lambda row: (parse_pat(row, col_number=0, patterns=pat_list)))
df_out = rdd_parse_pat.toDF(['text','match'])

But I get this:

df_out.show(truncate=False)

+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|text                                       |match                                                                                                                                               |
+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello this is a test +34666666666 677777777|[PHONE, (u'34666666666', u'34', u'', u'6666', u'6666', u'', u'', u'', u'', u'') (u'677777777', u'', u'', u'7777', u'7777', u'', u'', u'', u'', u'')]|
|Hello this a test 44442222M 33335555C      |[DNI, (u'44442222', u'M') (u'33335555', u'C')]                                                                                                      |
+-------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------+

Does anyone know how could I do it?

4
  • Just to clarify, what are you trying to match with [ \t|\-] and [9|6|7]? Because I don't think you're doing what you think you're doing. Also, {1} is redundant. Commented Feb 28, 2018 at 15:43
  • because of \b, \+? can't match because + is a non word character use (?<=\W) instead, seems you need non capturing groups (?: .. ) instead of groups ( .. ) Commented Feb 28, 2018 at 16:06
  • also all backslashes must be doubled for regex between single quotes, see if regex101 link does what you want Commented Feb 28, 2018 at 16:14
  • @NahuelFouilleul I tried using your regex but doen't match any Phone. Thank you! Commented Mar 2, 2018 at 8:36

1 Answer 1

1

You are joining all possible groups for each match... try with this:

def parse_pat(row, col_number, patterns):
    column = row[col_number]
    hit_words = {}
    for patron in patterns:
        patron_comp = re.compile(patron[1], re.IGNORECASE|re.DOTALL)
        start_pos = 0
        while True:
            match = patron_comp.search(column, start_pos)
            if match:
                key = patron[0]  #  Match name 
                value = match.group(0)  # Match value
                if key not in hit_words:
                    hit_words[key] = [value]  #  First match of its name
                else:
                    hit_words[key] = hit_words[key] + [value]   # Append to previous match
                start_pos = match.end(0)  # To look for other possible matches
            else:
                break  # There is no more matches.
    myrow = list(row)
    myrow.append(hit_words)
    return myrow

It creates a dict for each possible pattern.
- Key: pattern's name (PHONE, DNI ...)
- Value: list with all matches.

Result:

+-------------------------------------------+--------------------------------------------------+
|text                                       |match                                             |
+-------------------------------------------+--------------------------------------------------+
|Hello this is a test +34666666666 677777777|Map(PHONE -> WrappedArray(34666666666, 677777777))|
|Hello this a test 44442222M 33335555C      |Map(DNI -> WrappedArray(44442222M, 33335555C))    |
+-------------------------------------------+--------------------------------------------------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.