1

I have big data set with two columns and I use spark with pyspark module to analysis the data set. I try to draw line chart using "date" column and "count" column. But date column included 4 years details but those are not in order(according to day by day), dates are mixed. So firstly I want to re arranges the dates, past to present. and this date column, data type is string. Can I know to draw this time series line chart, this date column should have to convert in to "date type" if it is how I change this string type date values in to date types values?

enter image description here

enter image description here

enter image description here

5
  • 2
    Is your question really about drawing a line or about converting a string to date? I understood you are asking about the latter. Commented Apr 18, 2020 at 17:26
  • mainly I want to draw line chart but befor draw a chart I want to convertdate axis in order Commented Apr 18, 2020 at 17:41
  • 1
    OK. So, given you have your dataset right, you still don't know how to draw a line chart. Is it correct? Commented Apr 18, 2020 at 17:49
  • I'm new to spark. I don't know to draw line chart using spark. but I suppose to draw line chart for this two variables converting data frame in to panda data frame then using matplotlib module.. but before the drawing I want to re arrange date column in to ordered dates. Commented Apr 18, 2020 at 17:55
  • 1
    Yes, you are correct. Before drawing the chart, you need to convert the Spark DataFrame into some Python data structure. It could be a Pandas DF or "collecting" the Spark DataFrame. Commented Apr 18, 2020 at 18:01

1 Answer 1

3

Using Spark 2.4.3, you can convert your string dates like this:

import pyspark.sql.functions as sf

df = sparksession.createDataFrame(
    [("8 October 2018", 4407), ("17 September 2017", 13326)],
    ["date", "count"],
)
df.show()

df.select(
    sf.to_date("date", "d MMMMM yyyy").alias("new_date"), "date", "count"
).orderBy("new_date").show()

And these are the results:

+-----------------+-----+
|             date|count|
+-----------------+-----+
|   8 October 2018| 4407|
|17 September 2017|13326|
+-----------------+-----+

+----------+-----------------+-----+
|  new_date|             date|count|
+----------+-----------------+-----+
|2017-09-17|17 September 2017|13326|
|2018-10-08|   8 October 2018| 4407|
+----------+-----------------+-----+

PS.: For Spark 3.0.0, the string format has changed. The date conversion should use the string "d MMMM yyyy" (one less M), as it is documented here.

Chart

To draw a line chart, you could use Pandas and matplotlib:

pdf = (
    df.select(
        sf.to_date("date", "d MMMMM yyyy").alias("new_date"),
        "date",
        "count",
    )
    .orderBy("new_date")
    .toPandas()
)

pdf.plot.line(x="new_date", y="count")
Sign up to request clarification or add additional context in comments.

10 Comments

@bochat107 I followed your steps but my new_date columns not crated like your one. It creates null values. but that column, value type converted as date. I attached screen shot in question area below..
Hum... Which Spark version are you using?
version 3.0.0-preview2 this is the spark version
As I suspected, the date format seems to have changed for newer versions of Spark. Can you try "d MMMM yyyy" (one less M)?
1) what is meaning of - NameError: name 'sparksession' is not defined how I resolve this issue?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.