I have big data set with two columns and I use spark with pyspark module to analysis the data set. I try to draw line chart using "date" column and "count" column. But date column included 4 years details but those are not in order(according to day by day), dates are mixed. So firstly I want to re arranges the dates, past to present. and this date column, data type is string. Can I know to draw this time series line chart, this date column should have to convert in to "date type" if it is how I change this string type date values in to date types values?
-
2Is your question really about drawing a line or about converting a string to date? I understood you are asking about the latter.boechat107– boechat1072020-04-18 17:26:47 +00:00Commented Apr 18, 2020 at 17:26
-
mainly I want to draw line chart but befor draw a chart I want to convertdate axis in orderrandunu galhena– randunu galhena2020-04-18 17:41:35 +00:00Commented Apr 18, 2020 at 17:41
-
1OK. So, given you have your dataset right, you still don't know how to draw a line chart. Is it correct?boechat107– boechat1072020-04-18 17:49:04 +00:00Commented Apr 18, 2020 at 17:49
-
I'm new to spark. I don't know to draw line chart using spark. but I suppose to draw line chart for this two variables converting data frame in to panda data frame then using matplotlib module.. but before the drawing I want to re arrange date column in to ordered dates.randunu galhena– randunu galhena2020-04-18 17:55:25 +00:00Commented Apr 18, 2020 at 17:55
-
1Yes, you are correct. Before drawing the chart, you need to convert the Spark DataFrame into some Python data structure. It could be a Pandas DF or "collecting" the Spark DataFrame.boechat107– boechat1072020-04-18 18:01:06 +00:00Commented Apr 18, 2020 at 18:01
Add a comment
|
1 Answer
Using Spark 2.4.3, you can convert your string dates like this:
import pyspark.sql.functions as sf
df = sparksession.createDataFrame(
[("8 October 2018", 4407), ("17 September 2017", 13326)],
["date", "count"],
)
df.show()
df.select(
sf.to_date("date", "d MMMMM yyyy").alias("new_date"), "date", "count"
).orderBy("new_date").show()
And these are the results:
+-----------------+-----+
| date|count|
+-----------------+-----+
| 8 October 2018| 4407|
|17 September 2017|13326|
+-----------------+-----+
+----------+-----------------+-----+
| new_date| date|count|
+----------+-----------------+-----+
|2017-09-17|17 September 2017|13326|
|2018-10-08| 8 October 2018| 4407|
+----------+-----------------+-----+
PS.: For Spark 3.0.0, the string format has changed. The date conversion should use the string "d MMMM yyyy" (one less M), as it is documented here.
Chart
To draw a line chart, you could use Pandas and matplotlib:
pdf = (
df.select(
sf.to_date("date", "d MMMMM yyyy").alias("new_date"),
"date",
"count",
)
.orderBy("new_date")
.toPandas()
)
pdf.plot.line(x="new_date", y="count")
10 Comments
randunu galhena
@bochat107 I followed your steps but my new_date columns not crated like your one. It creates null values. but that column, value type converted as date. I attached screen shot in question area below..
boechat107
Hum... Which Spark version are you using?
randunu galhena
version 3.0.0-preview2 this is the spark version
boechat107
As I suspected, the date format seems to have changed for newer versions of Spark. Can you try
"d MMMM yyyy" (one less M)?randunu galhena
1) what is meaning of - NameError: name 'sparksession' is not defined how I resolve this issue?
|


