0

I have a below result of Dataset.

1.

+------+---------+--------+
| Col1 |  Col2   | NumCol |
+------+---------+--------+
| abc  | jun2016 |     25 |
| aac  | jun2017 |     28 |
| aac  | dec2017 |     30 |
| aac  | apr2018 |     45 |
+------+---------+--------+

when sorting is applied I get below result.

+------+---------+--------+
| Col1 |  Col2   | NumCol |
+------+---------+--------+
| aac  | apr2018 |     45 |
| aac  | dec2017 |     30 |
| aac  | jun2017 |     28 |
| abc  | jun2018 |     25 |
+------+---------+--------+

But instead should have been

+------+---------+--------+
| Col1 |  Col2   | NumCol |
+------+---------+--------+
| aac  | jun2017 |     28 |
| aac  | dec2017 |     30 |
| aac  | apr2018 |     45 |
| abc  | jun2018 |     25 |
+------+---------+--------+

According to chronological order. How will I be able to achieve the same.

2. When Week is present

When I have a column as Week as below

+------+-----------------------+--------+
| Col1 |         Week          | NumCol |
+------+-----------------------+--------+
| aac  | 02/04/2018-02/10/2018 |     45 |
| aac  | 02/11/2018-02/17/2018 |     25 |
| aac  | 01/28/2018-02/03/2018 |     30 |
+------+-----------------------+--------+

I want that to get sorted as below.

+------+-----------------------+--------+
| Col1 |         Week          | NumCol |
+------+-----------------------+--------+
| aac  | 01/28/2018-02/03/2018 |     30 |
| aac  | 02/04/2018-02/10/2018 |     45 |
| aac  | 02/11/2018-02/17/2018 |     25 |
+------+-----------------------+--------+

Here above I want to parse date of the column week as new Column dateweek , then sort the Week column and delete before returning the dataset.

Kind of challenging stuff for me.

for #1 I followed this But the issue with this is if there's suppose Jan2016,Feb2016,Jan2017 it gets sorted as Jan2016,Jan2017,Feb2016.

Need help for 2

1
  • are they separate problems or the same? Commented Jun 27, 2018 at 13:10

1 Answer 1

1

split the week and sort based on date.

import org.apache.spark.sql.functions.split
df.withColumn("_tmp", split($"Week", "-")).select($"Col1", $"Week", $"NumCCol1", $"_tmp".getItem(0).as("_sort")).sort("_sort").drop("_sort").show()

output-

+----+---------------------+--------+
|Col1|Week                 |NumCCol1|
+----+---------------------+--------+
|aac |01/28/2018-02/03/2018|30      |
|aac |02/04/2018-02/10/2018|45      |
|aac |02/11/2018-02/17/2018|25      |
+----+---------------------+--------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.