Forgive my terminology, I'm not an ML pro. I might use the wrong terms below.
I'm trying to perform multivariable linear regression. Let's say I'm trying to work out user gender by analysing page views on a web site.
For each user whose gender I know, I have a feature matrix where each row represents a web site section, and the second element whether they visited it, e.g.:
male1 = [
[1, 1], # visited section 1
[2, 0], # didn't visit section 2
[3, 1], # visited section 3, etc
[4, 0]
]
So in scikit, I am building xs and ys. I'm representing a male as 1, and female as 0.
The above would be represented as:
features = male1
gender = 1
Now, I'm obviously not just training a model for a single user, but instead I have tens of thousands of users whose data I'm using for training.
I would have thought I should create my xs and ys as follows:
xs = [
[ # user1
[1, 1],
[2, 0],
[3, 1],
[4, 0]
],
[ # user2
[1, 0],
[2, 1],
[3, 1],
[4, 0]
],
...
]
ys = [1, 0, ...]
scikit doesn't like this:
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(xs, ys)
It complains:
ValueError: Found array with dim 3. Estimator expected <= 2.
How am I supposed to supply a feature matrix to the linear regression algorithm in scikit-learn?