0

I'm having some trouble with a csv datasource which contains some duplicate IDs. The final result however should only have the ID once. Therefore it was decided that we should only take the first instance we see and ignore any other instances.

Currently my code is abit like this:

id_list = list()
for item in datasource: 
    if item[0] not in id_list:
        #process
        id_list.append(item[0])

The problem is that when the list grows, performance drops. I'm wondering if there are more efficient ways of tracking the already processed IDs?

3 Answers 3

5

Use a set object, sets are guaranteed to not have duplicates and provide fast membership testing. You can use set like this

id_list = set()
for item in datasource: 
    if item[0] not in id_list:
        # process
        id_list.add(item[0])

This will be better, since the lookup in set objects will happen in constant time, as opposed to the linear time lookup in lists.

Sign up to request clarification or add additional context in comments.

Comments

1

With reference to this question, I would suggest to use a dict.

Especially, the case of unique keys seems appropriate.

You could then try something like:

if key not in dict:
    [insert values in dict]

2 Comments

But the problem with this approach is, we cannot maintain the Order, as it is. We may have to use OrderedDict
I wasn't aware of the need for ordered results, because you already noted duplicate results...
0

Instead of using a list you can use a binary search tree ordered by the ID.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.