What is the output of a semantic segmentation network?
UNet (the one in the example) and essentially every other network that deals with Semantic Segmentation produce as output an image whose size is proportional to the input image and in which each pixel is classified as one of the possible classes specified.
For binary classification, typically the raw output is a single-channel float image with values in [0,1] that must be thresholded at 0.5 in order to obtain the "foreground" binary mask. It's also possible that the network is trained implicitly with two classes (foreground/background), in which case, read on for how to deal with for multi-class classification output.
For multi-class classification, the raw output image has N channels, one per class, with value at index [x, y, c] being the score for the pixel (think of it as the probability of pixel x,y to belong to class c, although in principle scores don't have to be probabilities). For each pixel, the selected class is the one of the channel with the highest score.
Images can then be postprocessed (e.g., flattening them and assigning to each pixel class label of the "winning" class), as it seems to be the case for the example you link (if you take a look at the implementation of labelVisualize(), they use a dict mapping class codes to colors).