0

I have the data below and need to create a line chart of x = Date and y = count.

The code I used to create the dataframe below was from another dataframe.

df7=df7.select("*",
concat(col("Month"),lit("/"),col("Year")).alias("Date"))
df7.show()

enter image description here

I've imported matplotlib.pyplot as plt and am still getting errors.

The code to plot I used in different variations as below:

df.plot(x = 'Date', y = 'Count')

df.plot(kind = 'line')

I keep getting this error though:

AttributeError: 'DataFrame' object has no attribute 'plt'/'plot'

6
  • I tried plot as well and get the same error. AttributeError: 'DataFrame' object has no attribute 'plot' Commented Nov 10, 2022 at 19:13
  • Please create a minimal reproducible example. Include code, error, and the data as text. Commented Nov 10, 2022 at 19:13
  • Unfortunately, when I copy and paste the table from pandas the format never follows. Commented Nov 10, 2022 at 19:18
  • How to make good reproducible pandas examples may be helpful. Commented Nov 10, 2022 at 19:19
  • you need to convert the dataframe from spark dataframe to pandas dataframe. try df_pd= df.toPandas() and then run the plot on df_pd. Commented Nov 10, 2022 at 19:36

1 Answer 1

1

Please note that using df_pd= df.toPandas() is sometimes expensive, and if you deal with a high number of records like a scale of M, you might face OOM error in Databricks medium or your session could be crashed due to a lack of RAM memory of the drive. Long story short, by using toPandas(), in fact, you are not using spark-based or distributed computation resources anymore! So alternatively, you can follow below approach: So let's start with a simple example:

import time
import datetime as dt
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.sql.functions import dayofmonth, dayofweek
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, TimestampType, DateType

dict2  = [("2021-08-11 04:05:06", 10),
         ("2021-08-12 04:15:06", 17),
         ("2021-08-13 09:15:26", 25),
         ("2021-08-14 11:04:06", 68),
         ("2021-08-15 14:55:16", 50),
         ("2021-08-16 04:12:11", 2),

  ]

schema = StructType([ 

    StructField("timestamp",     StringType(),    True), \
    StructField("count",         IntegerType(),    True), \
  ])
 
#create a Spark dataframe
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(data=dict2,schema=schema)
sdf.printSchema()
sdf.show(truncate=False)

#Generate date and timestamp
new_df = sdf.withColumn('timestamp',    F.to_timestamp("timestamp", "yyyy-MM-dd HH:mm:ss").cast(TimestampType())) \
            .withColumn('date',         F.to_date("timestamp",      "yyyy-MM-dd").cast(DateType())) \
            .select('timestamp', 'date', 'count') 

new_df.show(truncate = False)  

#root
# |-- timestamp: string (nullable = true)
# |-- count: integer (nullable = true)

#+-------------------+-----+
#|timestamp          |count|
#+-------------------+-----+
#|2021-08-11 04:05:06|10   |
#|2021-08-12 04:15:06|17   |
#|2021-08-13 09:15:26|25   |
#|2021-08-14 11:04:06|68   |
#|2021-08-15 14:55:16|50   |
#|2021-08-16 04:12:11|2    |
#+-------------------+-----+

#+-------------------+----------+-----+
#|timestamp          |date      |count|
#+-------------------+----------+-----+
#|2021-08-11 04:05:06|2021-08-11|10   |
#|2021-08-12 04:15:06|2021-08-12|17   |
#|2021-08-13 09:15:26|2021-08-13|25   |
#|2021-08-14 11:04:06|2021-08-14|68   |
#|2021-08-15 14:55:16|2021-08-15|50   |
#|2021-08-16 04:12:11|2021-08-16|2    |
#+-------------------+----------+-----+

Now you need to collect() the values of the columns you want to reflect your plot in the absence of Pandas; of course, this is expensive and takes (a long) time in big data records, but it works. Now you can apply one of the following ways:

#for big\high # of records
xlabels = new_df.select("timestamp").rdd.flatMap(list).collect()
ylabels = new_df.select("count").rdd.flatMap(list).collect()

#for limited # of records
xlabels   = [val.timestamp  for val in new_df.select('timestamp').collect()]
ylabels   = [val.count      for val in new_df.select('count').collect()]

To plot:

import matplotlib.pyplot as plt
import matplotlib.dates as md

fig, ax = plt.subplots(figsize=(10,6))
plt.plot(xlabels,     ylabels,   color='blue',    label="event's count") #, marker="o"
plt.scatter(xlabels,  ylabels,   color='cyan', marker='d', s=70)

plt.xticks(rotation=45)
plt.ylabel('Event counts \n# of records', fontsize=15)
plt.xlabel('timestamp', fontsize=15)
plt.title('Events over time', fontsize=15, color='darkred', weight='bold')
plt.legend(['# of records'], loc='upper right')

plt.show()

img

Based on comments, I assumed due to having lots of records that are printed under x-axis timestamps are not readable like the below pic:

img

To resolve this, you need to use the following approach to arrange x-axis ticks properly so that they would not plot on top of each other or ultimately side-by-side:

import pandas as pd
import matplotlib.pyplot as plt

x=xlabels
y=ylabels

#Note 1: if you use Pandas dataFrame after .toPandas()
#x=df['timestamp']
#y=df['count']

##Note 2: if you use Pandas dataFrame after .toPandas()
# convert the datetime column to a datetime type and assign it back to the column
#df.timestamp = pd.to_datetime(df.timestamp)
#verify timestamp column type is datetime64[ns] using print(df.info()) 

fig, ax = plt.subplots( figsize=(12,8))

plt.plot(x, y)

ax.legend(['# of records'])
ax.set_xlabel('Timestamp')
ax.set_ylabel('Event counts \n# of records')

# beautify the x-labels
import matplotlib.dates as md

plt.gcf().autofmt_xdate()
myFmt = md.DateFormatter('%Y-%m-%d %H:%M:%S.%f')
plt.gca().xaxis.set_major_formatter(myFmt)

plt.show()
plt.close()

img

Sign up to request clarification or add additional context in comments.

1 Comment

Very helpful. Thanks a lot!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.