33

I'm trying to write a neural Network for binary classification in PyTorch and I'm confused about the loss function.

I see that BCELoss is a common function specifically geared for binary classification. I also see that an output layer of N outputs for N possible classes is standard for general classification. However, for binary classification it seems like it could be either 1 or 2 outputs.

So, should I have 2 outputs (1 for each label) and then convert my 0/1 training labels into [1,0] and [0,1] arrays, or use something like a sigmoid for a single-variable output?

Here are the relevant snippets of code so you can see:

self.outputs = nn.Linear(NETWORK_WIDTH, 2) # 1 or 2 dimensions?


def forward(self, x):
  # other layers omitted
  x = self.outputs(x)           
  return F.log_softmax(x)  # <<< softmax over multiple vars, sigmoid over one, or other?

criterion = nn.BCELoss() # <<< Is this the right function?

net_out = net(data)
loss = criterion(net_out, target) # <<< Should target be an integer label or 1-hot vector?

Thanks in advance.

2 Answers 2

78

For binary outputs you can use 1 output unit, so then:

self.outputs = nn.Linear(NETWORK_WIDTH, 1)

Then you use sigmoid activation to map the values of your output unit to a range between 0 and 1 (of course you need to arrange your training data this way too):

def forward(self, x):
    # other layers omitted
    x = self.outputs(x)           
    return torch.sigmoid(x)  

Finally you can use the torch.nn.BCELoss:

criterion = nn.BCELoss()

net_out = net(data)
loss = criterion(net_out, target)

This should work fine for you.

You can also use torch.nn.BCEWithLogitsLoss, this loss function already includes the sigmoid function so you could leave it out in your forward.

If you, want to use 2 output units, this is also possible. But then you need to use torch.nn.CrossEntropyLoss instead of BCELoss. The Softmax activation is already included in this loss function.


Edit: I just want to emphasize that there is a real difference in doing so. Using 2 output units gives you twice as many weights compared to using 1 output unit.. So these two alternatives are not equivalent.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for great answer! The last question about 1 and 2 output units. When would I want to use one over another?
@никта Good question, actually I'm not sure if there is a preferred strategy when using these two. I think it is more a matter of taste. The main difference here is not the number of units but the loss function aka activation function softmax vs sigmoid. So you could check for the activation functions which suit your problem better. But in most cases you're probably not noticing a major difference in the capability of the resulting network. As this is a minor change I suggest to test both and check which works best for your problem :)
1

Some theoretical add up:

For binary classification (say class 0 & class 1), the network should have only 1 output unit. Its output will be 1 (for class 1 present or class 0 absent) and 0 (for class 1 absent or class 0 present).

For loss calculation, you should first pass it through sigmoid and then through BinaryCrossEntropy (BCE). Sigmoid transforms the output of the network to probability (between 0 and 1) and BCE then maximizes the likelihood of the desired output.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.