5

I'm using PySpark v1.6.1 and I want to create a dataframe using another one:

  • Convert a field that has a struct of three values in different columns
  • Convert the timestamp from string to datatime
  • Create more columns using that timestamp
  • Change the rest of the column names and types

Right now is using .map(func) creating an RDD using that function (which transforms from one row from the original type and returns a row with the new one). But this is creating an RDD and I don't wont that.

Is there a nicer way to do this?

3
  • If you are already able to create an RDD, you can easily transform it into DF. Commented Dec 27, 2017 at 14:09
  • but i don't want to create an RDD, i want to avoid using RDDs since they are a performance bottle neck for python, i just want to do DF transformations Commented Dec 27, 2017 at 14:14
  • 1
    Please provide some code of what you've tried so we can help. In the meantime, look up pyspark.sql.functions.udf and the withColumn() method of spark dataframes. Commented Dec 27, 2017 at 15:12

1 Answer 1

4
from pyspark.sql.functions import unix_timestamp, col, to_date, struct

####
#sample data
####
df = sc.parallelize([[25, 'Prem', 'M', '12-21-2006 11:00:05','abc', '1'],
                      [20, 'Kate', 'F', '05-30-2007 10:05:00', 'asdf', '2'],
                      [40, 'Cheng', 'M', '12-30-2017 01:00:01', 'qwerty', '3']]).\
    toDF(["age","name","sex","datetime_in_strFormat","initial_col_name","col_in_strFormat"])

#create 'struct' type column by combining first 3 columns of sample data - (this is built to answer query #1)
df = df.withColumn("struct_col", struct('age', 'name', 'sex')).\
    drop('age', 'name', 'sex')
df.show()
df.printSchema()

####
#query 1
####
#Convert a field that has a struct of three values (i.e. 'struct_col') in different columns (i.e. 'name', 'age' & 'sex')
df = df.withColumn('name', col('struct_col.name')).\
    withColumn('age', col('struct_col.age')).\
    withColumn('sex', col('struct_col.sex')).\
    drop('struct_col')
df.show()
df.printSchema()

####
#query 2
####
#Convert the timestamp from string (i.e. 'datetime_in_strFormat') to datetime (i.e. 'datetime_in_tsFormat')
df = df.withColumn('datetime_in_tsFormat',
                   unix_timestamp(col('datetime_in_strFormat'), 'MM-dd-yyyy hh:mm:ss').cast("timestamp"))
df.show()
df.printSchema()

####
#query 3
####
#create more columns using above timestamp (e.g. fetch date value from timestamp column)
df = df.withColumn('datetime_in_dateFormat', to_date(col('datetime_in_tsFormat')))
df.show()

####
#query 4.a
####
#Change column name (e.g. 'initial_col_name' is renamed to 'new_col_name)
df = df.withColumnRenamed('initial_col_name', 'new_col_name')
df.show()

####
#query 4.b
####
#Change column type (e.g. string type in 'col_in_strFormat' is coverted to double type in 'col_in_doubleFormat')
df = df.withColumn("col_in_doubleFormat", col('col_in_strFormat').cast("double"))
df.show()
df.printSchema()

Sample data:

+---------------------+----------------+----------------+------------+
|datetime_in_strFormat|initial_col_name|col_in_strFormat|  struct_col|
+---------------------+----------------+----------------+------------+
|  12-21-2006 11:00:05|             abc|               1| [25,Prem,M]|
|  05-30-2007 10:05:00|            asdf|               2| [20,Kate,F]|
|  12-30-2017 01:00:01|          qwerty|               3|[40,Cheng,M]|
+---------------------+----------------+----------------+------------+
root
 |-- datetime_in_strFormat: string (nullable = true)
 |-- initial_col_name: string (nullable = true)
 |-- col_in_strFormat: string (nullable = true)
 |-- struct_col: struct (nullable = false)
 |    |-- age: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- sex: string (nullable = true)

Final output data:

+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|datetime_in_strFormat|new_col_name|col_in_strFormat| name|age|sex|datetime_in_tsFormat|datetime_in_dateFormat|col_in_doubleFormat|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+
|  12-21-2006 11:00:05|         abc|               1| Prem| 25|  M| 2006-12-21 11:00:05|            2006-12-21|                1.0|
|  05-30-2007 10:05:00|        asdf|               2| Kate| 20|  F| 2007-05-30 10:05:00|            2007-05-30|                2.0|
|  12-30-2017 01:00:01|      qwerty|               3|Cheng| 40|  M| 2017-12-30 01:00:01|            2017-12-30|                3.0|
+---------------------+------------+----------------+-----+---+---+--------------------+----------------------+-------------------+

root
 |-- datetime_in_strFormat: string (nullable = true)
 |-- new_col_name: string (nullable = true)
 |-- col_in_strFormat: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)
 |-- sex: string (nullable = true)
 |-- datetime_in_tsFormat: timestamp (nullable = true)
 |-- datetime_in_dateFormat: date (nullable = true)
 |-- col_in_doubleFormat: double (nullable = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.