1

I wrote a function which converts the unicode encoding in my input data to utf-8 encoding.

The function is capable to handle the raw string OR dict OR list as an input and returns the respective utf-8 encoded output.

This function is a part of bigger project that i am working on. This function is giving correct output as expected.

The problem is that it is becoming a bottleneck for me in terms of execution time. Though the current execution time is turning out to be close to ~1ms but as i said, its a part of bigger project where i had to call this function repetitively which ultimately is harming my response time of API

def fix_unicode(self, data):
    if isinstance(data, unicode):
        return data.encode('utf-8')
    elif isinstance(data, dict):
        data = dict((fix_unicode(k), fix_unicode(data[k])) for k in data)
    elif isinstance(data, list):
        for i in xrange(0, len(data)):
            data[i] = fix_unicode(data[i])
    return data

Can i further optimise this function ? if yes how ?

6
  • 1
    use python3!!!! Commented Mar 13, 2018 at 9:42
  • how can it improve time ? Commented Mar 13, 2018 at 9:44
  • you mean speed? Commented Mar 13, 2018 at 9:46
  • @Rahul sorry, yes Commented Mar 13, 2018 at 9:49
  • 2
    If your code works without errors then a better place to ask might be Code Review. Commented Mar 13, 2018 at 10:32

2 Answers 2

3

You can improve execution speed by making a few changes:

  1. Check the type of data only once rather than 3 times. This can be achieved by something like data_type = type(data)
  2. Using a dictionary comprehension is a good idea. You can speed this up by calling the dictionary comprehension directly instead of (a) setting up the generator then (b) calling the dict function.
  3. Avoid recursion wherever possible when using python. Python doesn't have any form of Tail Call Opitimsation. So the call data = dict((fix_unicode(k), fix_unicode(data[k])) for k in data) is unsafe from the standpoint of a python program (Stack Overflow).
  4. You can avoid iterating over the list manually by using the higher order function map.

To achieve the above, we can break the function into 2 parts for modularity and efficiency:

def unicode_to_utf(self, string):
    """(unicode string) -> utf8_string"""
    return string.encode("utf-8")


def fix_unicode(self, data):
    data_type = type(data)
    assert data_type in (unicode, dict, list),\
            "data must be either a unicode string, list or dictionary"
    fix = unicode_to_utf  # create a local copy of the function for faster lookup speed
    if data_type is unicode:
        return fix(data)
    elif data_type is dict:
        return {fix(k): fix(v) for k, v in data.iteritems()}
    else:
        return map(fix, data)

If you would rather modify the list in place, you can replace return map(fix, data) with the comprehension [fix(datum) for datum in data] however such the function behavior will be inconsistent because it returns new objects for strings and dicts (although you can mutate the dict in place as well) whereas it modifies lists in place. That's a trade off for you to make.

In the mean time, your code should run faster.

Sign up to request clarification or add additional context in comments.

Comments

0

Though @xero-smith answer is excellent, you would like ask a question: are you sure you don't know the type of the data before you call the function? Usually, this kind of "overloaded" method is used at a higher level, not for a method that may become a bottelneck.

I can imagine two cases:

  1. You are the producer of data. Then you should know the type of data and your method is just a bad idea. In a compiled language, the choice of an overload method is done at compile time and there is no penalty when you have three fix_unicode methods, one for strings, one for dicts and one for lists. But here, just define three methods and pick the right one.
  2. You are only the consumer of the data. Then you should try to know the type of the data. Where does this data come from? A JSON post? a text file? Can't you convert it before making a dict or a list? You talked of an API: why not a parameter in the query string? Naturally, you should keep the actual API, but you add an optional "hint parameter" that will speed up your project. (This has to be benchmarked.) Try everything you can to avoid the check of the type of the data at runtime.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.