Link: https://www.kaggle.com/c/stanfordribonanzarnafolding
Problem Type: binary classification
Input: RNA sequences (strings)
 you also have other features like
signal_to_noise
Â  (float) Signal/noise value for the profile, defined as mean(measurement value over probed positions)/mean( statistical error in measurement value over probed positions).reactivity_error
columns etc.
Output: the reactivity probability for reactivity_DMS_MaP
and reactivity_2A3_MaP
for eachÂ sequence positionÂ id
 https://www.kaggle.com/code/dschettler8845/srrnareactivitylearnedabaseline#data_exploration
 i.e. If the sum of the lengths of all 1,343,824 sequences in the test set is 269,796,671, then you should make 2 Ă 269,796,670 = 539,593,340 predictions.
Eval Metric: MAELoss
Summary
 Ribonanza_bpp_filesÂ Â
TXT
Â files listing position pairs predicted to have nonzero WatsonCrick base pair probabilities by the LinearPartitionEternaFold package. Files are given for train and test sequences, indexed byÂsequence_id
. Note: this package simulates RNA secondary structure ensembles without pseudoknots or other tertiary structure features. It is also limited to WatsonCrick base pairs, even though other kinds of RNA interactions are known to form.
Important notebooks/discussions
 How to check if your model generalizes to long sequences https://www.kaggle.com/competitions/stanfordribonanzarnafolding/discussion/444653
 x axis represents position, while y axis represents sequence number
 Since the test set contains longer sequences than the train set, the model doesnât âknowâ how to deal with unseenbefore positions and does not generalize, i.e., the generated pictures are not similar to the ones posted byÂ @shujun717Â above. As for other embeddings, I advise you to experiment and find out what works best
 this picture is also a prediction, not ground truth. So it is possible that your model will do better than theirs.
 so thereâs no point in trying to get your model to output a similar image
 sliding window:
 the idea: you train a transformer on sequences of 300 tokens.
 How do you get predictions for sequences of 500 tokens?
 Simple. You just use your 300token transformer multiple times and take the average of the predictions.
 you slide your 300token transformer until all 500 tokens were in at least one prediction
 Why they thought it was good for generalization:
 cause the training data was at most 206 long, but the test data is at most 457 long
 so to generalize their models (trained on the shorter training data), they could apply the sliding window on the longer test data examples.
 However, it didnât work, since it is âwrong in a biological senseâ (1st place writeup)
 the idea: you train a transformer on sequences of 300 tokens.
 comments
 About the sliding window, Iâm not sure what is the benefit of using it with such short sequences, the main idea behind sliding window is to reduce the complexity from O(n^2) to O(n * w) where w is smaller than n but the RNA sequences of this competition are â 457, so the normal attention can be computed without much trouble.

(1st place): âsliding window seems to be uselessâ

It can be useful in this problem, since in train set Lmax=206, and in test Lmax=457
My guess was if I am to use sliding window attention, my generalization plots would become closer to the ones Shujun shared, but it was not the case, they are still âtoo sharpâ

 About the sliding window, Iâm not sure what is the benefit of using it with such short sequences, the main idea behind sliding window is to reduce the complexity from O(n^2) to O(n * w) where w is smaller than n but the RNA sequences of this competition are â 457, so the normal attention can be computed without much trouble.
 Decent EDA:
 https://www.kaggle.com/code/ayushs9020/understandingthecompetitionstandfordribonaza
Solutions

(1st) Transformer model with Dynamic positional encoding + CNN for BPPM features
 SqueezeandExcitation layer
 To allow better generalization for longer input we implemented Dynamic Positional Bias
 use DBSCAN
 Subsetting data (filtering by different thresholds on SN ratio) resulted in a performance boost for all models. However, this technique was superseded by weight sampling, which proved itself to be more effective.
 by subsetting data, I think they mean: âThey only selected training rows based on the signal to noise ratio of each exampleâ
 How did they calculate the signal to noise ratio?
 by weight sampling, I think they mean: âthey weighed each training row based on how high itâs signal to noise ratioâ
 by subsetting data, I think they mean: âThey only selected training rows based on the signal to noise ratio of each exampleâ
 We tried to use data about predicted 3D structure of 100k sequences from the train dataset but gave up on that once we had visually analyzed them:
 probably cause they all looked the same in 3D?
 Using absolute positional embedding leads to unsolvable issues when generalizing upon longer sequences.
 How to solve the positional embedding to generalize to longer sequences
 absolute positional embedding. it doesnât generalize to longer sequences!
 Try shifting the positional embeddings to the right
 e.g. if our sequence is ends 40 tokens below the context length, we can shift the positional embeddings up to 40 indexes to the right
 so the first token has a positional embedding of 30 (for example)
 e.g. if our sequence is ends 40 tokens below the context length, we can shift the positional embeddings up to 40 indexes to the right
 Rotational positional embedding, unfortunately, doesnât help the model to generalize on larger lengths.
 ALiBi positional embedding solves the issue with extrapolation but even after keeping it only for a part of heads (as suggested inÂ https://github.com/lucidrains/xtransformers) still behaves worse than dynamic positional bias.
 xpos positional encoding
 Unfortunately, xpos shows rather poor performance on long sequences so we abstained from using this model in the final submission.
 We have also tried shift augmentation and different sequence padding approaches. This didnât improve our model performance as well.
 remove test data leakage
 13% of the public test sequences are identical to the ones present in the train dataset (by sequence)
 To avoid selecting a model that is memorizing more of these sequences, we zeroed out the predictions for these sequences
 sometimes we sent nonzeroed out submissions in order to compare our performance to other participants.
 SqueezeandExcitation layer

(2nd) Squeezeformer layer + BPP Conv2D Attention
 https://www.kaggle.com/competitions/stanfordribonanzarnafolding/discussion/460316
 solution code: https://github.com/hoyso48/StanfordRibonanzaRNAFolding2ndplacesolution
 squeezeformer was the most efficient (compared to newer ConvTransformer Hybrid architectures)
 showed strong performance early in training and consistently showed faster convergence.
 why use a GRU layer, especially after a transformer layer???
 it just yielded minor improvements. This was probably just intuition
 used ALiBi positional encoding since itâs claimed to generalize better over long sequences than other methods. (it worked better too)
 add signal to attention bias
 this was their model
 notice the output of the BPP 2D Convnet layer is fed into the multihead attention layer, specifically, learnable biases
 code
 this was their model

(3rd) AlphaFold Style Twin Tower Architecture + Squeezeformer layer
 https://github.com/GosUxD/OpenChemFold
 recycling from alphafold2 wasnât useful
 this solution VERY inspired by alphafold. that architecture is v complicated and I didnât look into how it works rn :â(
Takeaways
 add signal to attention bias using the BPP features (generated from feeding BPP into a conv net) was used by all the top 3 teams