• Each sample can belong to one or multiple classes at the same time.
  • Since each x can belong to more than one correct label, you want to apply a sigmoid to each individual class, rather than a Softmax over all classes
    • sigmoid means: each label has a 0-1 probability of being chosen, but softmax will only allow one label to have a high probability of being chosen