1

I'm working through a beginner's ML code, and in order to count the number of unique samples in a column, the author uses this code:

def unique_vals(rows, col):
    """Find the unique values for a column in a dataset."""
    return set([row[col] for row in rows])

I am working with a DataFrame however, and for me, this code returns single letters: 'm', 'l', etc. I tried altering it to:

set(row[row[col] for row in rows)

But then it returns:

KeyError: "None of [Index(['Apple', 'Banana', 'Grape'   dtype='object', length=2318)] are in the [columns]"

Thanks for your time!

2 Answers 2

5

In general, you don't need to do such things yourself because pandas already does them for you.

In this case, what you want is the unique method, which you can call on a Series directly (the pd.Series is the abstraction that represents, among other things, columns), and which returns a numpy array containing the unique values in that Series.

If you want the unique values for multiple columns, you can do something like this:

which_columns = ... # specify the columns whose unique values you want here

uniques = {col: df[col].unique() for col in which_columns}
Sign up to request clarification or add additional context in comments.

2 Comments

you can also leverage the fact that Numpy operates on an entire array at once and do {*np.unique(df[which_columns].values)}
@piRSquared I believe that would only work if the columns were homogenous...?
3

If you are working on categorical columns then following code is very useful

It will not only print the unique values but also print the count of each unique value

col = ['col1', 'col2', 'col3'...., 'coln']

#Print frequency of categories
for col in categorical_columns:
    print ('\nFrequency of Categories for varible %s'%col)
    print (bd1[col].value_counts())

Example:

df

     pets     location     owner
0     cat    San_Diego     Champ
1     dog     New_York       Ron
2     cat     New_York     Brick
3  monkey    San_Diego     Champ
4     dog    San_Diego  Veronica
5     dog     New_York       Ron


categorical_columns = ['pets','owner','location']
#Print frequency of categories
for col in categorical_columns:
    print ('\nFrequency of Categories for varible %s'%col)
    print (df[col].value_counts())

Output:

# Frequency of Categories for varible pets
# dog       3
# cat       2
# monkey    1
# Name: pets, dtype: int64

# Frequency of Categories for varible owner
# Champ       2
# Ron         2
# Brick       1
# Veronica    1
# Name: owner, dtype: int64

# Frequency of Categories for varible location
# New_York     3
# San_Diego    3
# Name: location, dtype: int64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.