1

I have a very large string of key value pairs (old_string) that is formatted as so:

"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."

this string is very large since it can be up to 30k customers. I am using this to write a file to upload to an online segmentation tool that requires that it is formatted this way with one modification -- the primary key (visitorid) needs to be tab separated and not in quotes. The end result needs to look like this (note the 4 spaces is a tab):

gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks    "customer_name"="larry", "customer_state"="alabama",...ABC3k9sk-gj49-92ks-dgjs-j2ks-j29slgj9bbbb

I wrote the following function that does this fine, but ive noticed that this portion of the script runs the slowest (I am assuming because regex is generally slow).

def getGUIDS(old_string):
    '''
    Finds guids in the string and formats it as PK for syncfile
    @param old_string the string created from the nested dict
    @return old_string_fmt the formatted version
    '''

    print ('getting ids')
    ids = re.findall('("\w{8}-\w{4}-\w{4}-\w{4}-\w{12}",)', cat_string) #looks for GUID based on regex

    for element in ids:
      new_str = str(element.strip('"').strip('"').strip(",").strip('"') + ('\t'))
      old_string_fmt = old_string.replace(element, new_str)


    return old_string_fmt

Is there a way this can be done without regex that might speed this up?

1

1 Answer 1

2

The approach is wrong: you match all occurrences meeting your regex and then replace all occurrences with modified matches. You may simply use re.sub to find all non-overlapping matches and replace them with what you need.

See this Python demo:

import re

def getGUIDS(old_string):
    '''
    Finds guids in the string and formats it as PK for syncfile
    @param old_string the string created from the nested dict
    @return old_string_fmt the formatted version
    '''
    print ('getting ids')
    return re.sub(r'"\w+"="(\w{8}(?:-\w{4}){4}-\w{12})"(?:,|$)', '\\1\t', old_string) #looks for GUID based on regex

s='"visitorid"="gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks", "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."'
print(getGUIDS(s))
# => getting ids
# => gh43k9sk-gj49-92ks-jgjs-j2ks-j29slgj952ks   "customer_name"="larry", "customer_state"="alabama",..."visitorid"="..."

I added "\w+"= at the start of the regex to also match the key of the GUID value to remove it, replaced a , at the end with (?:,|$) to match either a , or end of string (to also handle cases when the key-value is the last one in the string) and enclosed the part you need to keep with capturing parentheses.

The replacement is a backreference to the capturing group #1 and a tab char.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.