42

I have a data frame with following type:

col1|col2|col3|col4
xxxx|yyyy|zzzz|[1111],[2222]

I want my output to be of the following type:

col1|col2|col3|col4|col5
xxxx|yyyy|zzzz|1111|2222

My col4 is an array, and I want to convert it into a separate column. What needs to be done?

I saw many answers with flatMap, but they are increasing a row. I want the tuple to be put in another column but in the same row.

The following is my current schema:

root
 |-- PRIVATE_IP: string (nullable = true)
 |-- PRIVATE_PORT: integer (nullable = true)
 |-- DESTINATION_IP: string (nullable = true)
 |-- DESTINATION_PORT: integer (nullable = true)
 |-- collect_set(TIMESTAMP): array (nullable = true)
 |    |-- element: string (containsNull = true)

Also, can please someone help me with explanation on both dataframes and RDD's.

5
  • 1
    What's the schema of your data frame? Can you show df.printSchema()? Commented Jul 22, 2017 at 13:26
  • Hi, I edited the question with my actual schema Commented Jul 22, 2017 at 13:41
  • Does all cells in the array column have the same number of elements? Always 2? What if another row have three elements in the array? Commented Jul 22, 2017 at 13:47
  • No all the elements have exactly 2 elements. Because the element in the array are a start date and end date. Commented Jul 22, 2017 at 13:56
  • Also this is my actual requirment if you can help me with it. stackoverflow.com/questions/45252906/… Commented Jul 22, 2017 at 14:00

2 Answers 2

67

Create sample data:

from pyspark.sql import Row
x = [Row(col1="xx", col2="yy", col3="zz", col4=[123,234])]
rdd = sc.parallelize([Row(col1="xx", col2="yy", col3="zz", col4=[123,234])])
df = spark.createDataFrame(rdd)
df.show()
#+----+----+----+----------+
#|col1|col2|col3|      col4|
#+----+----+----+----------+
#|  xx|  yy|  zz|[123, 234]|
#+----+----+----+----------+

Use getItem to extract element from the array column as this, in your actual case replace col4 with collect_set(TIMESTAMP):

df = df.withColumn("col5", df["col4"].getItem(1)).withColumn("col4", df["col4"].getItem(0))
df.show()
#+----+----+----+----+----+
#|col1|col2|col3|col4|col5|
#+----+----+----+----+----+
#|  xx|  yy|  zz| 123| 234|
#+----+----+----+----+----+
Sign up to request clarification or add additional context in comments.

5 Comments

@Lydia please be extremely careful and sure you know what you are doing when altering code: your edit had ruined a perfectly good answer, leading it to throw an exception (restored it to OP's original)...
Do you have a way to generalize the iteration over the original col4's array ?
@Amesys Did you try destructuring a list comprehension ?
I have a follow-up question, dropping the link, thanks in advance! stackoverflow.com/questions/61823544/… @Psidom
What is efficient way to apply this to more than 10 columns but each column has only one item in list
13

You have 4 options to extract the value inside the array:

df = spark.createDataFrame([[1, [10, 20, 30, 40]]], ['A', 'B'])
df.show()

+---+----------------+
|  A|               B|
+---+----------------+
|  1|[10, 20, 30, 40]|
+---+----------------+

from pyspark.sql import functions as F

df.select(
    "A",
    df.B[0].alias("B0"), # dot notation and index        
    F.col("B")[1].alias("B1"), # function col and index
    df.B.getItem(2).alias("B2"), # dot notation and method getItem
    F.col("B").getItem(3).alias("B3"), # function col and method getItem
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

In case you have many columns, use a list comprehension:

df.select(
    'A', *[F.col('B')[i].alias(f'B{i}') for i in range(4)]
).show()

+---+---+---+---+---+
|  A| B0| B1| B2| B3|
+---+---+---+---+---+
|  1| 10| 20| 30| 40|
+---+---+---+---+---+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.