Alternative to Regex for large string format/replace

Question

I have a very large string of key value pairs (old_string) that is formatted as so:

"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."

this string is very large since it can be up to 30k customers. I am using this to write a file to upload to an online segmentation tool that requires that it is formatted this way with one modification -- the primary key (visitorid) needs to be tab separated and not in quotes. The end result needs to look like this (note the 4 spaces is a tab):

gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks    "customer_name"="larry", "customer_state"="alabama",...ABC3k9sk-gj49-92ks-dgjs-j2ks-j29slgj9bbbb

I wrote the following function that does this fine, but ive noticed that this portion of the script runs the slowest (I am assuming because regex is generally slow).

def getGUIDS(old_string):
    '''
    Finds guids in the string and formats it as PK for syncfile
    @param old_string the string created from the nested dict
    @return old_string_fmt the formatted version
    '''

    print ('getting ids')
    ids = re.findall('("\w{8}-\w{4}-\w{4}-\w{4}-\w{12}",)', cat_string) #looks for GUID based on regex

    for element in ids:
      new_str = str(element.strip('"').strip('"').strip(",").strip('"') + ('\t'))
      old_string_fmt = old_string.replace(element, new_str)


    return old_string_fmt

Is there a way this can be done without regex that might speed this up?

github.com/scripal-git/scripal It was removed and SO is not for software recommendation. — Oleksii Kyslytsyn
– Oleksii Kyslytsyn, Commented Aug 8 at 7:43

Wiktor Stribiżew · Accepted Answer · 2018-02-22 21:40:59Z

The approach is wrong: you match all occurrences meeting your regex and then replace all occurrences with modified matches. You may simply use re.sub to find all non-overlapping matches and replace them with what you need.

See this Python demo:

import re

def getGUIDS(old_string):
    '''
    Finds guids in the string and formats it as PK for syncfile
    @param old_string the string created from the nested dict
    @return old_string_fmt the formatted version
    '''
    print ('getting ids')
    return re.sub(r'"\w+"="(\w{8}(?:-\w{4}){4}-\w{12})"(?:,|$)', '\\1\t', old_string) #looks for GUID based on regex

s='"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."'
print(getGUIDS(s))
# => getting ids
# => gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks   "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."

I added "\w+"= at the start of the regex to also match the key of the GUID value to remove it, replaced a , at the end with (?:,|$) to match either a , or end of string (to also handle cases when the key-value is the last one in the string) and enclosed the part you need to keep with capturing parentheses.

The replacement is a backreference to the capturing group #1 and a tab char.

Collectives™ on Stack Overflow

Alternative to Regex for large string format/replace

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related