4

I am reading a text file with python, formatted where the values in each column may be numeric or strings.

When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).

What would be an efficient way to do it?

3 Answers 3

12

Use a defaultdict with a default value factory that generates new ids:

ids = collections.defaultdict(itertools.count().next)
ids['a']  # 0
ids['b']  # 1
ids['a']  # 0

When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.

collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.

Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

Sign up to request clarification or add additional context in comments.

13 Comments

This is not what he asked?
@BurhanKhalid: How so?
This should do exactly what he needs. If he's iterating through rows of data, each unique value gets a unique integer by simple insertion. Duplicate checking is built in.
He asked to only assign values to strings not each column: "When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column)." He can get a running counter by just enumerating over the file.
@BurhanKhalid: So he looks up the string in the defaultdict and gets its ID. What's the problem? EDIT: Are you looking at an old version of the answer? The first version (gone now) just had the counter, without the defaultdict, since I didn't see the requirement of assigning the same ID to a string if it shows up twice.
|
2

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:

ids = collections.defaultdict(functoools.partial(next, itertools.count()))

Comments

0

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.

Here I am assuming the second column is the one you want to scan for text or integers.

seen = set()
with open('somefile.txt') as f:
   reader = csv.reader(f, delimiter=',')
   for row in reader:
      try:
         int(row[1])
      except ValueError:
         seen.add(row[1]) # adds string to set

# print the unique ids for each string

for id,text in enumerate(seen):
    print("{}: {}".format(id, text))

Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:

unique_strings = [set(), set(), set()]

with open('file.txt') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
       for column,value in enumerate(row):
           try:
               int(value)
           except ValueError:
               # It is not an integer, so it must be
               # a string
               unique_strings[column].add(value)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.