Link: https://www.kaggle.com/c/novozymes-enzyme-stability-prediction

Problem Type: ordering objects in list

Input:

Output: the predicted rank thermal stability of enzyme variants

  • officially, you want to output n tuples (seq_id, tm) - tm is the melting temperature of the enzyme
    • Higher tm means the protein variant is more stable.
  • Since only the Spearman Correlation Coefficient will be used for the evaluation, the correct prediction of the relative order is more important than the absolute tm values.
    • Basically, the tm value that you predict is just used to order your predictions (before running spearman’s correlation coefficient with the t_true)

Eval Metric: Spearman’s correlation Coefficient

Summary

  • Enzymes are proteins that act as catalysts in the chemical reactions of living organisms.

  • Novozymes finds enzymes in nature and optimizes them for use in industry.

  • why do they care about the tm levels?

    • cause a difference in the melting point (and pH) can affect the efficiency of the enzyme
      • not sure why they care about ranking though
  • glossary:

    • “residue” are amino acids and nucleic acids
  • Note: there were training data issues: https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/356251

Important notebooks/discussions

  • EDA and explaining what this competition is

    • https://www.kaggle.com/code/dschettler8845/novo-esp-eda-baseline
    • The frequency of some amino acids respective to the length of the AA sequence has a low, but non-zero correlation with the target tm�� (melting point)
    • The fractional frequency appears slightly more informative
    • Some amino acid sequences are REALLY long… especially considering all of our test AA sequences are relatively short ~240 vs more than 32,000
    • While some amino acids are more common than others, all are relatively common (Sequences on average contain 1.5% to 10% of each amino acid respectively).
    • ph is from 0.0-11.0
    • there is a negative tm (target) value. why?
  • highest public kernel before the contest end

    • https://www.kaggle.com/code/seyered/eda-novozymes-enzyme-stability
    • He basically used a bunch of techniques from many different papers
      • these techniques tended to look at the sequence of the acid to predict the tm
      • then he ensembled the tm predictions of the techniques using this
        • code
           Global ensemble:
          testDf['tm'] = (
              4 * rank_nrom('rosetta') + 2*rank_nrom('rmsd') + 2*rank_nrom('thermonet') + 2*rank_nrom('plddtdiff') +\
              rank_nrom('sasaf') + rank_nrom('plddt') + rank_nrom('demask') + rank_nrom('ddG') + rank_nrom('blosum')
          ) / 14
      • the way he found the coefficients in the ensemble is by increasing 1 to the weight until it led to a lower public score
        • yep they were definitely overfitting.
    • interesting to note: ppl said that CV and the public LB were unstable
  • Difference features

  • Showing that there is a high correlation between the distance between the mutation’s (xyz) and the enzyme’s center (xyz)

Solutions

  • (1st) Protein as a Graph

    • https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/376371
    • “there was a big shakeup, so this isn’t the best solution”
    • Nodes represent residues, and if two residues are closer than a certain distance, the nodes are connected by edges.
    • The features of each node are the embedding vectors of wildtype sequence and mutant sequence generated by HuggingFace’s facebook/esm2_t33_650M_UR50D
    • graph centrality -  measure of the influence of a node in a network
    • they mention dTm, but nobody else mentioned this.
      • I assume it’s the difference between tm and (mutation coord - enzym center cord) in the train data
    • this was the code he used to create the graph from the pdb (3D coords of protein) files:
    • I don’t understand what they mean when they say that they are using a graph isomorphism network.
      • are they comparing the mutation of the enzyme to the original enzyme, and using the original enzyme as a reference for the mutation, so that the tm value is just an offset of the tm of the original enzyme?
      • I guess training an IGN outputs an embedding that represents the graph. now that they have an embedding, they can use it as a feature vector for downstream models

Takeaways

  • TODO