2

I am trying following code:

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
model = linear_model.LogisticRegression()
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

X=scaler.fit_transform(X)

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

model.fit(X_train,y_train)
# Make predictions using the testing set
powerOutput_y_pred = model.predict(X_test)
print (powerOutput_y_pred)
# The coefficients
print('Coefficients: \n', model.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(y_test, powerOutput_y_pred))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % r2_score(y_test, powerOutput_y_pred))

plt.scatter(X_test, y_test,  color='black')
plt.plot(X_test, powerOutput_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

But i am getting the following error for the scatter plot:

ValueError: x and y must be the same size

If i run df.head(), i get following table:

df structure

The features X and y are as below:

X=df.values[:,[0,1,2,3,4,5,7]]
y=df.values[:,6]

Running X.shape gives (25,7) and y.shape gives (25, ) as output. So how to fix this shape mismatch?

3
  • What do you expect the scatter plot to look like? That is, what relationship are you attempting to plot? Commented Apr 8, 2018 at 5:34
  • I am trying to follow this: scikit-learn.org/stable/auto_examples/linear_model/… But this has used one feature only. I am trying to use many 7 features in X. Commented Apr 8, 2018 at 5:49
  • 1
    Best answer turned out to be even simpler than I thought. If you just replace scatter with plot then everything should work as is. If you set a few options correctly (ls, marker, and ms), calling plot will produce a scatter plot. Commented Apr 8, 2018 at 7:46

1 Answer 1

2

Simplest answer

Just use plot instead of scatter:

plt.plot(X_test, y_test, ls="none", marker='.', ms=12)

This will plot the different sets of x data all using the same single set of y data. This assumes that x.shape == (n,d) and y.shape == (n,), as in your question above.

Simple answer

Loop over the columns of your x values, and call scatter once for each column:

colors = plt.cm.viridis(np.linspace(0.0, 1.0, features))
for xcol,c in zip(X_test.T, colors):
    plt.scatter(xcol, y_test, c=c)

Setting c with the array colors will make it so that each feature is plotted as a different color on the scatter plot. If do you want them all to be black, just replace the colors stuff above with c='black'

details

scatter expects one list of x values and one list of y values. It's simplest if the x and y list are 1D. However you can also plot multiple sets of x and y data stored in 2D arrays, if those arrays have matching shape.

From the Matplotlib docs:

Fundamentally, scatter works with 1-D arrays; x, y, s, and c may be input as 2-D arrays, but within scatter they will be flattened.

A bit vague, but a dive into the Matplotlib source code confirms that the shapes of x and y have to match exactly. The code that handles shapes for plot is more flexible, so for that function you can away get with using one set of y data for many sets of x data.

Normally plot plots lines instead of dots, but you can turn lines off by setting ls (ie linestyle), and you can turn dots on by setting marker. ms (ie markersize) controls the size of the dots.

example

The example you posted above won't run (X and y aren't defined), but here's a complete example with output:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm

from sklearn import datasets
from sklearn.model_selection import train_test_split

d = datasets.load_diabetes()
features = d.data.shape[1]

X = d.data[:50,:]
Y = d.target[:50]

sample_weight = np.random.RandomState(442).rand(Y.shape[0])

# split train, test for calibration
X_train, X_test, Y_train, Y_test, sw_train, sw_test = \
    train_test_split(X, Y, sample_weight, test_size=0.9, random_state=442)

# use the plot function instead of scatter
# plot one set of y data against several sets of x data
plt.plot(X_test, Y_test, ls="none", marker='.', ms=12)

# call .scatter() multiple times in a loop
#colors = plt.cm.viridis(np.linspace(0.0, 1.0, features))
#for xcol,c in zip(X_test.T, colors):
#    plt.scatter(xcol, Y_test, c=c)

output:

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

The simple answer seems to work fine but for the simple answer, i can see that y.shape is still the same as y.T.shape so its not working

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.