2

Forgive my terminology, I'm not an ML pro. I might use the wrong terms below.

I'm trying to perform multivariable linear regression. Let's say I'm trying to work out user gender by analysing page views on a web site.

For each user whose gender I know, I have a feature matrix where each row represents a web site section, and the second element whether they visited it, e.g.:

male1 = [
    [1, 1],     # visited section 1
    [2, 0],     # didn't visit section 2
    [3, 1],     # visited section 3, etc
    [4, 0]
]

So in scikit, I am building xs and ys. I'm representing a male as 1, and female as 0.

The above would be represented as:

features = male1
gender = 1

Now, I'm obviously not just training a model for a single user, but instead I have tens of thousands of users whose data I'm using for training.

I would have thought I should create my xs and ys as follows:

xs = [
    [          # user1
       [1, 1],    
       [2, 0],     
       [3, 1],    
       [4, 0]
    ],
    [          # user2
       [1, 0],    
       [2, 1],     
       [3, 1],    
       [4, 0]
    ],
    ...
]

ys = [1, 0, ...]

scikit doesn't like this:

from sklearn import linear_model

clf = linear_model.LinearRegression()
clf.fit(xs, ys)

It complains:

ValueError: Found array with dim 3. Estimator expected <= 2.

How am I supposed to supply a feature matrix to the linear regression algorithm in scikit-learn?

1 Answer 1

3

You need to create xs in a different way. According to the docs:

fit(X, y, sample_weight=None)

Parameters:

    X : numpy array or sparse matrix of shape [n_samples, n_features]
        Training data
    y : numpy array of shape [n_samples, n_targets]
        Target values
    sample_weight : numpy array of shape [n_samples]
        Individual weights for each sample

Hence xs should be a 2D array with as many rows as users and as many columns as web site sections. You defined xs as a 3D array though. In order to reduce the number of dimensions by one you could get rid of the section numbers through a list comprehension:

xs = [[visit for section, visit in user] for user in xs]

If you do so, the data you provided as an example gets transformed into:

xs = [[1, 0, 1, 0], # user1
      [0, 1, 1, 0], # user2
      ...
      ]

and clf.fit(xs, ys) should work as expected.

A more efficient approach to dimension reduction would be that of slicing a NumPy array:

import numpy as np
xs = np.asarray(xs)[:,:,1]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.