Link: https://www.kaggle.com/c/LANL-Earthquake-Prediction

Problem Type: regression, signal processing

Input: the acoustic data (the shaky line that is draw by seismometers)

Output: the time (in seconds) until the next laboratory earthquake

Eval Metric: Mean absolute error (MAE)

Summary

  • This was a competition with a high leaderboard shakeup because ppl overfit the public LB
  • Your goal is to predict the time remaining before laboratory earthquakes occur from real-time seismic data.
  • these aren’t real earthquakes. it’s a lab setup:
  • Why did many ppl overfit the public lb?
    • they blindly subtracted the data by the mean (of the acoustic data) during training, but the mean shifts in the test set

Important notebooks/discussions

Solutions

  • (1st) Add noise to denoise median statistic PSI’s solution

    • https://www.kaggle.com/competitions/LANL-Earthquake-Prediction/discussion/94390
    • solution code:
    • “After doing abovementioned signal manipulation, we had more trust in our calculated features and could focus on better studying differences between train and test data feature distributions”
    • we calculated a handful of features for train and test and tried to find a good subset of full earth-quakes in train, so that the overall feature distributions are similar to those of the full test data.
    • a simple shuffled 3-kfold
    • adversarial validation shows that the signal had a certain time-trend that made mean or quantile features unreliable
      • e.g. mean number of “spikes” in a sliding window of 100ms of data
    • One of our best final LGB model only used four features:
    • Those 4 are decently uncorrelated between themselves, and add good diversity.
    • For each feature we always only considered it if it has a p-value >0.05 on a KS statistic of train vs test. considering features
    • why does adding noise then subtracting the median de-noise the signal? Add noise to denoise median statistic
      • https://www.kaggle.com/competitions/LANL-Earthquake-Prediction/discussion/94390#553415
      • If you take mean of train and test, you can see the testing set has a higher mean than the training set. Therefore we elect to subtract the mean to make the test more similar to the train. However, even after subtracting the mean, the feature related to median (e.g. np.median(z)) still had dissimilarities between the train and test segment. To solve this, we tried subtracting solely the median (instead of the mean), but this still did not solve the issue. Therefore we injected small amount of noise, and then subtracted the median.
        • After adding the “noise” (std 0.5), which is actually lower or in the range of the expected accuracy of the sensor (1), absmedian behaves very normally (same distribution for train and test)
          • where z = z - np.mean(z); absmedian = np.median(abs(z))
          • I think because after noise was injected, the median they subtract by is the TRUE median (much less disturbed by random points)
          • “with the raw data, the median for the earthquakes were: 3, 4, 4,5, or 5” - by CPMP
            • Once noise is added then median value distribution is not as sparse.
      • why did they want to subtract by median not mean?
        • cause median is more robust to outliers than mean
      • Our features are then calculated on this manipulated signal.
    • KS statistic showed that there’s a drift between the train and test data
      • so we decided to sample the train data to make it look more like we expect test data to look like (only from looking at feature distributions)
      • we calculated a handful of features for train and test and tried to find a good subset of full earth-quakes in train, so that the overall feature distributions are similar to those of the full test data.
      • We did this by sampling 10 full earthquakes multiple times (up to 10k times) on train, and comparing the average KS statistic of all selected features on the sampled earthquakes to the feature dists in full test.
        • “The x-axis is the average target of the selected EQs in train and the y-axis is the KS statistic on a bunch of features comparing the distribution of that feature for the selected EQs vs the full test data. We can see that the best average KS-statistic is somewhere in the range of 6.2-6.5. You can also see nicely here that a problematic feature like the green one deviates clearly from the rest, this would be a feature we would not select in the end.”
    • time since an event occurred as an auxiliary target: “We have one additional binary logloss with the target specifying if the time to failure (ttf) is <0.5 and one further MAE loss on the target of time-since-failure.”
    • Neural nets trained on raw data failed
  • (2nd) lots of signal processing features. made train data look like test data

  • (3rd) made train data look like test data LSTM

    • https://www.kaggle.com/competitions/LANL-Earthquake-Prediction/discussion/94459
    • Already mentioned in the discussion , we expected test from p4677.
    • the yhat is found via (NOTE: I don’t completely understand this) :
      • best_y = np.median(np.hstack([np.repeat(cycle_len, int(100*cycle_len)) for cycle_len in chunk_length_list]))
        • the code creates a long array by repeating each chunk length value proportionally to the length itself.
          • Then it calculates the median of that array.
          • The assumption behind the code is likely that the chunk lengths that occur most often are more “typical” or “representative” and thus are the “best” values
          1. cycle_len in chunk_length_list: This iterates over each cycle_len in a given list of chunk lengths, referred to as chunk_length_list.
          1. np.repeat(cycle_len, int(100*cycle_len)): For each cycle_len from the chunk_length_list, it repeats cycle_len int(100*cycle_len) times. For example, if cycle_len is 2, it creates an array [2, 2, 2,...] with length int(100*2) = 200.
          1. np.hstack([]): This function is used for horizontally stacking all the arrays created for each cycle_len in chunk_length_list and making one single flattened array.
          1. np.median(): Finally, the median of the created array is calculated and is stored in best_y.
  • (7th) Used Signal processing techniques to generate features

Takeaways