11

I have a dataframe with 15 columns (4 categorical and the rest numeric).

I have created dummy variables for every categorical variable. Now I want to find the number of variables in my new dataframe.

I tried calculating length of printSchema(), but is NoneType:

print type(df.printSchema())

2
  • What have you tried? Have you searched the web? Commented Mar 15, 2017 at 9:17
  • 1
    try to check len(df.columns) Commented Mar 15, 2017 at 9:23

1 Answer 1

24

You are finding it wrong way, Here is sample example for this and about printSchema:-

df = sqlContext.createDataFrame([
    (1, "A", "X1"),
    (2, "B", "X2"),
    (3, "B", "X3"),
    (1, "B", "X3"),
    (2, "C", "X2"),
    (3, "C", "X2"),
    (1, "C", "X1"),
    (1, "B", "X1"),
], ["ID", "TYPE", "CODE"])


# Python 2:
print len(df.columns) #3
# Python 3
print(len(df.columns)) #3

columns provides list of all columns and we can check len. Instead printSchema prints schema of df which have columns and their data type, ex below:-

root
 |-- ID: long (nullable = true)
 |-- TYPE: string (nullable = true)
 |-- CODE: string (nullable = true)
Sign up to request clarification or add additional context in comments.

3 Comments

On pyspark console len(df.columns) is enough, not needed print.
Really hope there's an OOP solution like .length or .size, etc.
What about RDD? IF I have RDD not dataframe, how to display number of columns @Rakesh Kumar @ chuck

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.