I'm trying to use gradient descent on a data set. What I have written is
import numpy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data = pd.read_csv('C:/Users/Teacher/Downloads/data.csv')
X = data.iloc[:, 0] # selects all data from first column in data
Y = data.iloc[:, 1]
plt.scatter(X,Y)
plt.show()
n = len(X)
a = 0
b = 0
L = .001
for i in range(1000):
y_predicted = a * X + b
pd_a = (1 / n) * sum((y_predicted - Y) * X)
pd_b = (1 / n) * sum(y_predicted - Y)
a = a - L * pd_a
b = b - L * pd_b
print(a, b)
plt.scatter(X, Y)
c, d = numpy.polyfit(X, Y, 1)
print(c, d)
plt.plot([min(X), max(X)], [a * x + b for x in [min(X), max(X)]], [c * x + d for x in [min(X), max(X)]])
plt.show()
If I instead define X and Y = np.random.rand(20), then everything seems to work fine, so I the issue appears to be with the iput from csv.
However, the scatterplot for X and Y is still fine, even when I define them as the first and second column of my data set, so I'm not sure what's going on.
Edit: Here is an image of the scatterplot after defining X = data.iloc[:, 0] Y = data.iloc[:, 1]
Here is an image of the plot and line at the end of the code.
The result of print(data.head()):
Edit: reading just one line of the csv:




