Assign strings to IDs in Python

Question

I am reading a text file with python, formatted where the values in each column may be numeric or strings.

When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column).

What would be an efficient way to do it?

user2357112 · Accepted Answer · 2013-09-04 08:02:28Z

12

Use a defaultdict with a default value factory that generates new ids:

ids = collections.defaultdict(itertools.count().next)
ids['a']  # 0
ids['b']  # 1
ids['a']  # 0

When you look up a key in a defaultdict, if it's not already present, the defaultdict calls a user-provided default value factory to get the value and stores it before returning it.

collections.count() creates an iterator that counts up from 0, so collections.count().next is a bound method that produces a new integer whenever you call it.

Combined, these tools produce a dict that returns a new integer whenever you look up something you've never looked up before.

edited Sep 4, 2013 at 8:02

answered Sep 4, 2013 at 4:41

user2357112

286k32 gold badges491 silver badges571 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Burhan Khalid Over a year ago

This is not what he asked?

user2357112 Over a year ago

@BurhanKhalid: How so?

g.d.d.c Over a year ago

This should do exactly what he needs. If he's iterating through rows of data, each unique value gets a unique integer by simple insertion. Duplicate checking is built in.

Burhan Khalid Over a year ago

He asked to only assign values to strings not each column: "When those values are strings, I need to assign a unique ID of that string (unique across all the strings under the same column; the same ID must be assigned if the same string appears elsewhere under the same column)." He can get a running counter by just enumerating over the file.

user2357112 Over a year ago

@BurhanKhalid: So he looks up the string in the defaultdict and gets its ID. What's the problem? EDIT: Are you looking at an old version of the answer? The first version (gone now) just had the counter, without the defaultdict, since I didn't see the requirement of assigning the same ID to a string if it shows up twice.

|

Greg Allen · Accepted Answer · 2013-10-09 10:06:38Z

2

defaultdict answer updated for python 3, where .next is now .__next__, and for pylint compliance, where using "magic" __*__ methods is discouraged:

ids = collections.defaultdict(functoools.partial(next, itertools.count()))

answered Oct 9, 2013 at 10:06

Greg Allen

5233 silver badges9 bronze badges

Comments

Burhan Khalid · Accepted Answer · 2013-09-04 05:07:50Z

Create a set, and then add strings to the set. This will ensure that strings are not duplicated; then you can use enumerate to get a unique id of each string. Use this ID when you are writing the file out again.

Here I am assuming the second column is the one you want to scan for text or integers.

seen = set()
with open('somefile.txt') as f:
   reader = csv.reader(f, delimiter=',')
   for row in reader:
      try:
         int(row[1])
      except ValueError:
         seen.add(row[1]) # adds string to set

# print the unique ids for each string

for id,text in enumerate(seen):
    print("{}: {}".format(id, text))

Now you can take the same logic, and replicate it across each column of your file. If you know the column length in advanced, you can have a list of sets. Suppose the file has three columns:

unique_strings = [set(), set(), set()]

with open('file.txt') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
       for column,value in enumerate(row):
           try:
               int(value)
           except ValueError:
               # It is not an integer, so it must be
               # a string
               unique_strings[column].add(value)

Collectives™ on Stack Overflow

Assign strings to IDs in Python

3 Answers 3

13 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

13 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related