1

I am following this post to create a number of columns that are true/false based on if a substring is present in another column.

Prior to using the code in the above post, I look at a field called LANGUAGES which has values such as "ENG, SPA, CZE" or "ENG, SPA". Unfortunately, the data is a comma-delimited string instead of a list but no problem, in one line, I can get the list of 25 unique values.

Once I get the list of unique values, I want to make a new column for each value so a df[ENG], df[SPA], etc. columns. I want these columns to be true/false based on if the header is a substring of the original LANGUAGES column.

Following the post, I use df.apply(lambda x: language in df.LANGUAGES, axis = 1). However, when I check the values of the columns (the value counts in the last for loop), all the values come up false.

How can I create a true/false column based on the column's header being a substring of another column?

My Code:

import json
import pandas as pd
import requests

url  = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="

response = requests.get(url)

if response.status_code == 200:
    res = response.json()
    df = pd.DataFrame(res)
    df.columns = [str(h).upper() for h in list(df)]
    #
    # the below line is confusing but it creates a sorted list of all unique languages
    #
    languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
    for language in languages:
        print(language)
        df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)
    for language in languages:
        print(df[language].value_counts())
        print("\n")
else:
    print("\nConnection was unsuccesful: {0}".format(response.status_code))

Edit: There was a request for a raw data input and expected output. Here is what a column looks like:

+-------+-----------------+
| Index |    LANGUAGES    |
+-------+-----------------+
|     0 | 'ENG, OTH, RUS' |
|     1 | 'ENG'           |
|     2 | 'ENG, CZE, SPA' |
+-------+-----------------+

This is the expected output:

+-------+-----------------+------+-------+-------+-------+-------+
| Index |    LANGUAGES    | ENG  |  CZE  |  OTH  |  RUS  |  SPA  |
+-------+-----------------+------+-------+-------+-------+-------+
|     0 | 'ENG, OTH, RUS' | TRUE | FALSE | TRUE  | TRUE  | FALSE |
|     1 | 'ENG'           | TRUE | FALSE | FALSE | FALSE | FALSE |
|     2 | 'ENG, CZE, SPA' | TRUE | TRUE  | FALSE | TRUE  | FALSE |
+-------+-----------------+------+-------+-------+-------+-------+
0

2 Answers 2

2

Two steps,

first, we explode your list and create a pivot table to re-concat to your original df based on the index.

s  = df['LANGUAGES'].str.replace("'",'').str.split(',').explode().to_frame()

cols = s['LANGUAGES'].drop_duplicates(keep='first').tolist()

df2 = pd.concat([df, pd.crosstab(s.index, s["LANGUAGES"])[cols]], axis=1).replace(
    {1: True, 0: False}
)
print(df2)
         LANGUAGES   ENG    OTH    RUS    CZE    SPA
0  'ENG, OTH, RUS'  True   True   True  False  False
1            'ENG'  True  False  False  False  False
2  'ENG, CZE, SPA'  True  False  False   True   True
Sign up to request clarification or add additional context in comments.

Comments

0

Found in this post, I swapped out the following line of code:

df[language] = df.apply(lambda x: language in df.LANGUAGES, axis = 1)

for the following two lines:

    criteria = lambda row : language in row["LANGUAGES"]
    df[language] = df.apply(criteria, axis =1)

And it works.

import json
import pandas as pd
import requests

url  = r"https://data.hud.gov/Housing_Counselor/search?AgencyName=&City=&State=&RowLimit=&Services=&Languages="

response = requests.get(url)

if response.status_code == 200:
    res = response.json()
    df = pd.DataFrame(res)
    df.columns = [str(h).upper() for h in list(df)]
    #
    # the below line is confusing but it creates a sorted list of all unique languages
    #
    languages = [str(s) for s in sorted(list(set((",".join(list(df["LANGUAGES"].unique()))).split(","))))]
    for language in languages:
        criteria = lambda row : language in row["LANGUAGES"]
        df[language] = df.apply(criteria, axis =1)
    for language in languages:
        print(df[language].value_counts())
        print("\n")
else:
    print("\nConnection was unsuccesful: {0}".format(response.status_code))

This line swap could also work:

for language in languages:
    df[language] = df.LANGUAGES.apply(lambda x: 'True' if language in x else 'False')
    print("{}:{}".format(language, df[df[language] == 'True'].shape[0]))

1 Comment

best not to use for loops with pandas, it's against the api - use vectorised methods that take advantage of the underlying C /Cython code.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.