2

Humbly asking again for the community's help.

I have a task in Data Analysis, to research the connections between different columns of the dataset given. For that sake I have to edit the columns I want to work with. The column I need contains data, which looks like a list of dictionaries, but it's actually a string. So I have to edit it to take 'name' values from those former "dictionaries".

The code below represents my magical rituals to take "name" values from that string, to save them in another column as a string with only those "name" values collected in a list, after what I would apply that function to a whole column and group it by unique combinations of those strings with "name" values. (Maximum-task was to separate those "name" values for several additional columns, to sort them later by all these columns; but the problem appeared, that a huge string in source column (df['specializations']) can contain a number of "dictionaries", so I can't know exactly, how many additional columns to create for them; so I gave up on that idea.)

Typical string with pseudo-list of dictionaries looks like that (the number of those "dictionaries" varies):

[{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}]

import re

def get_names_values(df):
    for a in df['specializations']:
        for r in (("\'", "\""), ("[", ""), ("]", ""), ("}", "")):
            a = a.replace(*r)
        a = re.split("{", a)
        m = 0
        while m < len(a):
            if a[m] in ('', ': ', ', '):
                del a[m]
            m += 1
        a = "".join(a)
        a = re.split("\"", a)
        n = 0
        while n < len(a):
            if a[n] in ('', ': ', ', '):
                del a[n]
            n += 1
        nameslist = []
        for num in range(len(a)):
            if a[num] == 'name':
                nameslist.append(a[num+1])
        return str(nameslist)


df['specializations_names'] = df['specializations'].fillna('{}').apply(get_names_values)
df['specializations_names']

The problem arouses with for a in df['specializations']:, as it raises TypeError: string indices must be integers. I checked that cycle separately, like (print(a)), and it gave me a proper result; I tried it also via:

for k in range(len(df)):
  a = df['specializations'][k]

and again, separately it worked as I needed, but inside my function it raises TypeError. I feel like I'm going to give up on ['specialization'] column and try researching some others; but still I'm curious what's wrong here and how to solve this problem.

Huge thanks to all those who will try to advise, in advance.

1 Answer 1

2

What you've encountered as a "string with pseudo-list of dictionaries" seems to be json data. You may use eval() to convert it to an actual list of dicts and then operate with it normally. Use eval() with caution, though. I tried to recreate that string and make it work:

str_dicts = str([{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'},
                 {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'},
                 {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}])

dicts = list(eval(str_dicts))
     
names = [d['name'] for d in dicts]

print(names)

[0]: ['Beginner', 'Testing', 'IT']

If your column is a Series of strings that are in fact lists of dicts, then you may want to do such list comprehension:

df['specializations_names'] = [[d['name'] for d in list(eval(row))] 
                               for row in df['specializations']]

I tried to partially reproduce what you tried to do from what you provided:

import pandas as pd

str_dicts = str([{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'},
                 {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'},
                 {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}])

df = pd.DataFrame({'specializations': [str_dicts, str_dicts, str_dicts]})

df['specializations_names'] = [[d['name'] for d in list(eval(row))] 
                               for row in df['specializations']]

print(df)

Which resulted in:

specializations specializations_names
0 [{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}] ['Beginner', 'Testing', 'IT']
1 [{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}] ['Beginner', 'Testing', 'IT']
2 [{'id': '1.172', 'name': 'Beginner', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '1.117', 'name': 'Testing', 'profarea_id': '1', 'profarea_name': 'IT'}, {'id': '15.93', 'name': 'IT', 'profarea_id': '15', 'profarea_name': 'Beginner'}] ['Beginner', 'Testing', 'IT']

Consequently, there could be strings with lists of any number of dicts instead of the dummies I used, as many as the length of df.

Sign up to request clarification or add additional context in comments.

4 Comments

If it's JSON data, the preferred way to turn it into a Python object is to use the json.loads function. If it's a string representation of a Python dict, you should use ast.literal_eval, not eval, because the latter can and will execute arbitrary, potentially harmful code contained in the string.
Thanks a lot, guys (nu, v smysle, spasibishe bolshushyeye :D). I've tried it before all aforementioned machinations via df['specializations_json'] = df['specializations'].fillna('{}').apply(eval), but it showed no reaction, strings remained unchanged. So I gave up on JSON idea that time. Happy to see it can be solved so elegantly.
@fsimonjetz thanks for your addition! I respect your security concern, though I have to note that ast.literal_eval took 5 seconds for the whole action to complete, while simple eval took 3 secs.
Actually, what our tutor said about working with DataFrames, that they (and Pandas in whole) don't really like to work with lists and weren't designed for that. But okay, if it works, then it works, never mind))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.