6

df1.printSchema() prints out the column names and the data type that they possess.

df1.drop($"colName") will drop columns by their name.

Is there a way to adapt this command to drop by the data-type instead?

2 Answers 2

9

If you are looking to drop specific columns in the dataframe based on the types, then the below snippet would help. In this example, I have a dataframe with two columns of type String and Int respectivly. I am dropping my String (all fields of type String would be dropped) field from the schema based on its type.

import sqlContext.implicits._

val df = sc.parallelize(('a' to 'l').map(_.toString) zip (1 to 10)).toDF("c1","c2")

df.schema.fields
    .collect({case x if x.dataType.typeName == "string" => x.name})
    .foldLeft(df)({case(dframe,field) => dframe.drop(field)})

The schema of the newDf is org.apache.spark.sql.DataFrame = [c2: int]

Sign up to request clarification or add additional context in comments.

2 Comments

How can I apply this approach on nested columns? It does not work on columns in Struct or Array type columns.
you will have to un-furl the struct and re-create the struct with all the fields except the ones you wanted to drop. Something similar along the same lines must be followed for Array type as well..
2

Here is a fancy way in scala:

var categoricalFeatColNames = df.schema.fields filter { _.dataType.isInstanceOf[org.apache.spark.sql.types.StringType] } map { _.name }

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.