How to convert column with string type to int form in pyspark data frame?

Question

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type.

How I can change them to int type. I replaced the nan values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

my data looks like this:

here columns Plays and drafts containing integer values but because of nan present in these columns, they are treated as string type.

Is there any way by which while reading the data only we can replace the nan so that in schema those columns will be treated as int type. — neha
– neha, Commented Oct 26, 2017 at 13:59
you will have to make the whole column to be a integer in your case I believe. — Chetan_Vasudevan
– Chetan_Vasudevan, Commented Oct 26, 2017 at 14:01

Sahil Desai · Accepted Answer · 2019-06-21 12:22:21Z

176

from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

You can run loop for each column but this is the simplest way to convert string column into integer.

edited Jun 21, 2019 at 12:22

answered Oct 26, 2017 at 20:27

Sahil Desai

3,6964 gold badges22 silver badges45 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Mohit Sharma Over a year ago

Hi @sahil-desai it's giving me null value. However, while printing schema gives me Integer. Could you justify it why?

Sahil Desai Over a year ago

@Moi if value is non numeric and you are going to cast it then it's converted in to null values. What is previous datatype of your data?

GenDemo Over a year ago

Thank you - I had the cast on the outside with the withColumn and I couldn't understand why it was giving an error.

Ani Menon · Accepted Answer · 2018-02-21 20:14:05Z

19

You could use cast(as int) after replacing NaN with 0,

data_df = df.withColumn("Plays", df.call_time.cast('float'))

answered Feb 21, 2018 at 20:14

Ani Menon

28.4k17 gold badges111 silver badges128 bronze badges

Comments

Keshav Pradeep Ramanath · Accepted Answer · 2018-11-04 06:02:10Z

Another way to do it is using the StructField if you have multiple fields that needs to be modified.

Ex:

from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
       StructField('OPEN_FLG',IntegerType(),True),
       StructField('I1_GNDR_CODE',StringType(),True),
       StructField('TRW_INCOME_CD_V4',StringType(),True),
       StructField('ASIAN_CD',IntegerType(),True),
       StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)

Output:

Before

root
 |-- CLICK_FLG: string (nullable = true)
 |-- OPEN_FLG: string (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)

After:

root
 |-- CLICK_FLG: integer (nullable = true)
 |-- OPEN_FLG: integer (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)

This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.

It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.

Collectives™ on Stack Overflow

How to convert column with string type to int form in pyspark data frame?

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related