38

I have a data frame where most of the columns are varchar/object type. Length of the column varies a lot and could be anything within the range of 3 - 1000+ . Now, for each column, I want to measure maximum length.

I know how to calculate maximum length for a col. If its varchar then:

max(df.char_col.apply(len))

and if its number (float8 or int64) then:

max(df.num_col.map(str).apply(len))

But my dataframe has hundreds of column and I want to calculate maximum length for all columns at the same time. The problem for that is, there are different data types, and I dont know how to do all at once.

So Question 1: How to get maximum column length for each columns in the data frame

Now I am trying to do that only for varchar/object type columns using following code:

xx = df.select_dtypes(include = ['object'])
for col in [xx.columns.values]:
   maxlength = [max(xx.col.apply(len))]

I selected only object type columns and tried to write a for loop. But its not working. probably using apply() within for loop is not a good idea.

Question 2: How to get maximum length of each column for only object type columns

Sample data frame:

d1 = {'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA'], 'comment1': ['Very good performance', 'N/A', 'Need to work hard', 'No Comment', 'Not satisfactory'], 'comment2': ['good', 'Meets Expectation', 'N', 'N/A', 'Incompetence']}
df1 = pd.DataFrame(data = d1)
df1['month'] = pd.DatetimeIndex(df1['DoB']).month
df1['year'] = pd.DatetimeIndex(df1['DoB']).year
0

7 Answers 7

43

You can use min max after using str and len method

df["A"].str.len().max()
df["A"].str.len().min()

df["Column Name"].str.len().max()
df["Column Name"].str.len().min()
Sign up to request clarification or add additional context in comments.

3 Comments

Odd how the simplest, cleanest answer has only one upvote compared to 24. Anyways, this answer works perfectly for me.
Thanks Dan, much appreciated :)
Simple and clean!
39

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions.

You can use pd.DataFrame.select_dtypes to select object columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
                   'B': ['a', 'abcde', 'abc'],
                   'C': [1, 2.5, 1.5]})

measurer = np.vectorize(len)

Max length for all columns

res1 = measurer(df.values.astype(str)).max(axis=0)

array([4, 5, 3])

Max length for object columns

res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)

array([4, 5])

Or if you need output as a dictionary:

res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5, 'C': 3}

df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5}

2 Comments

Does not work well for large data sets. Failed with a warning numpy.core._exceptions.MemoryError: Unable to allocate 4.18 TiB for an array with shape (4130207, 5) and data type <U55682
@Harvey posted below a very valid answer: for column in df: print(column,"->", df[column].astype(str).str.len().max())
18

Finding max number of chars for all columns (any type) in dataframe:

for column in df:
    print(column,"->", df[column].astype(str).str.len().max())

works fairly fast, I'm using it on 80 million rows df.

Comments

11

Some great answers here and I would like to contribute mine

Solution:

dict([(v, df[v].apply(lambda r: len(str(r)) if r!=None else 0).max())for v in df.columns.values])

Explanation:

#convert tuple to dictionary
dict( 
    [
        #create a tuple such that (column name, max length of values in column)
        (v, df[v].apply(lambda r: len(str(r)) if r!=None else 0).max()) 
            for v in df.columns.values #iterates over all column values
    ])

Sample output

{'name': 4, 'DoB': 10, 'Address': 2, 'comment1': 21, 'comment2': 17}

Comments

10

Select only object type columns

df2 = df1[[x for x in df1 if df1[x].dtype == 'object']]

Get the maximum length in each column

max_length_in_each_col = df2.applymap(lambda x: len(x)).max()

4 Comments

For some reason the df2 statement is pulling in all column types, not just object.
I've changed ´O´ to ´object´ - it should select object columns now.
Once the correct columns have been selected, xx.map(len).max() is exactly what I was looking for!
Recent pandas versions warn to use .map instead of .applymap. And the whole lambda can be just len.
4

I tried numpy.vectorize but it gave 'Memory Error' for huge dataframe.

The below code worked perfectly for me. It will give you a list of maximum lengths for each column in an excel spreadsheet (read into a dataframe using pandas)

import pandas as pd

xl = pd.ExcelFile('sample.xlsx')
df = xl.parse('Sheet1')

maxColumnLenghts = []
for col in range(len(df.columns)):
    maxColumnLenghts.append(max(df.iloc[:,col].astype(str).apply(len)))
print('Max Column Lengths ', maxColumnLenghts)

Comments

1

Both answers below using dict comprehension:

So Question 1: How to get maximum column length for each columns in the data frame

max_length_all_cols = {col: df.loc[:, col].astype(str).apply(len).max() for col in df.columns}

Question 2: How to get maximum length of each column for only object type columns

max_length_object_cols = {col: df.loc[:, col].astype(str).apply(len).max() for col in df.select_dtypes([object]).columns}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.