1

Not sure if the title makes complete sense so sorry about that. I'm new to Machine Learning and I'm using Scikit and decision trees.

Here's what I want to do; I want to take all of my inputs and include a unique feature which is a client ID. Now, the client ID is unique and can't be summed up in the normal way a feature would in decision tree analysis. What's happening now is that the tree is taking the client ID's as any other integer value and then branching it saying for instance, client ID's less than 430 go in a different path than those over 430. This isn't correct and not what I want to do. What I want to do is make the decision tree understand that the specific field can't be analyzed in such a way and each client will have their own branch. Is this possible with decision trees?

I do have a couple workarounds, one of which would be to develop unique decision trees for each client but training this would be a nightmare. I could also do another workaround, and lets say we have 800 clients, I would create 800 features with a bit field, but this is also crazy.

4
  • Yes, the second option you described (one - hot encoding) is what I would suggest for your description Commented Feb 21, 2017 at 17:09
  • This seems like a whole lot of work though, what If I need to expand to thousands of clients, is this the best way? Commented Feb 21, 2017 at 17:17
  • Because I'm using pandas, I'm guessing the get_dummies function is probably my best bet? Commented Feb 21, 2017 at 17:33
  • You've pretty well described your own solution: you need to use a tool that allows you to exclude the ID as an analysis feature. Commented Feb 21, 2017 at 17:47

1 Answer 1

4

This is a fairly common problem in machine learning. A machine learning feature can't be unique to each instance in any case. Intuitively it makes sense; the algorithm doesn't learn anything if it can't extrapolate from that feature.

What you can do is just separate out that piece of information from the decision tree before you pass the rest of the features, and just re-merge the ID and the prediction after it is made.

I would strongly discourage any kind of manipulation of the feature vector to include the ID in any form. Features are only supposed to be things that the algorithm is supposed to use to make decisions. Don't give it information you don't want it to use. You're right in wanting to avoid using an ID as a feature because (most likely) the ID has no bearing on whatever you're trying to predict.

If you do want individual models (and have enough data for each user that you can make them), its not as big a pain as you might be thinking. You can use Scikit's model saving feature and this answer on saving pickles to MySQL to easily create and store personalized models. Unless you have a very large number of users, creating personalized decision trees shouldn't take very long.

Sign up to request clarification or add additional context in comments.

2 Comments

Well, here's the thing. This is obviously based on supervised learning but it's possible, but I'm not sure how possible, that each client could have slightly different outcomes. It's something that I would need to test because if I use something like one-hot encoding and it provides bad outcomes with the client, I would just throw it away. However, based on your comment above, I will do this in reverse and I will not use one-hot. I'll do some testing to gauge accuracy and if it doesn't work, I'll explore the other options.
One-hot encoding is a great idea, but user IDs are rarely good categorical features in any capacity. If you signed up for StackOverflow two weeks before (or after) I did, does that lend any insight into making predictions about us as users? The answer is almost always "no".

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.