Link: https://www.kaggle.com/c/tweet-sentiment-extraction/overview

Problem Type: substring segmentation

Input: A tweet

  • each tweet is classified as either neutral, positive, or negative
    • I assume a human labels the sentiment

Output: the substring (selected_text column in train.csv) that determines the sentiment of the model

Eval Metric: jaccard similarity (aka Intersection Over Union) between your output and the selected_text column

Summary

Data quality issues:

  • https://www.kaggle.com/c/tweet-sentiment-extraction/discussion/159254
  • when humans tweet, they might add extra spaces:
    • “is back home now gonna miss every one”
  • so before they gave the tweets to the annotators, they remove consecutive spaces from the original tweet:
  • since the label is retrieved on the original text, the dataset’s target labels are WRONG (if there were consecutive spaces before the labelled text)
  • Note: this isn’t the only source of data quality issues. more cleaning was required

Important notebooks/discussions

Solutions

Takeaways

  • nobody used a regression strategy to predict the start/end indices
    • they don’t “use bert to output 2 integers from 1 to n representing the start and end idx of the substring”
    • for each token in the input, your model predicts the probability that it’s in the tweet substring or not
      • so if there’s n tokens as the input, the output is dimension n
  • you can take the logits of the final layer and do things with it… TODO fully understand