Novozymes Enzyme Stability Prediction

Link: https://www.kaggle.com/c/novozymes-enzyme-stability-prediction

Problem Type: ordering objects in list

Input:

you also get alphafold2 predictions (wildtype) of the enzyme variants
- these variants are found via “point amino acid mutation and deletion”
  - This is why they are called mutations
- you can see how the mutations were made in this notebook: https://www.kaggle.com/code/gehallak/nesp-3d-geometry-0-32-lb

Output: the predicted rank thermal stability of enzyme variants

officially, you want to output n tuples (seq_id, tm) - tm is the melting temperature of the enzyme
- Higher tm means the protein variant is more stable.
Since only the Spearman Correlation Coefficient will be used for the evaluation, the correct prediction of the relative order is more important than the absolute tm values.
- Basically, the tm value that you predict is just used to order your predictions (before running spearman’s correlation coefficient with the t_true)

Eval Metric: Spearman’s correlation Coefficient

Summary

Enzymes are proteins that act as catalysts in the chemical reactions of living organisms.
Novozymes finds enzymes in nature and optimizes them for use in industry.
why do they care about the tm levels?
- cause a difference in the melting point (and pH) can affect the efficiency of the enzyme
  - not sure why they care about ranking though
glossary:
- “residue” are amino acids and nucleic acids
Note: there were training data issues: https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/356251

Important notebooks/discussions

EDA and explaining what this competition is
- https://www.kaggle.com/code/dschettler8845/novo-esp-eda-baseline
- The frequency of some amino acids respective to the length of the AA sequence has a low, but non-zero correlation with the target tm�� (melting point)
- The fractional frequency appears slightly more informative
- Some amino acid sequences are REALLY long… especially considering all of our test AA sequences are relatively short ~240 vs more than 32,000
- While some amino acids are more common than others, all are relatively common (Sequences on average contain 1.5% to 10% of each amino acid respectively).
- ph is from 0.0-11.0
- there is a negative tm (target) value. why?
highest public kernel before the contest end
- https://www.kaggle.com/code/seyered/eda-novozymes-enzyme-stability
- He basically used a bunch of techniques from many different papers
  - these techniques tended to look at the sequence of the acid to predict the tm
  - then he ensembled the tm predictions of the techniques using this
    - code
      Global ensemble: testDf['tm'] = ( 4 * rank_nrom('rosetta') + 2*rank_nrom('rmsd') + 2*rank_nrom('thermonet') + 2*rank_nrom('plddtdiff') +\ rank_nrom('sasaf') + rank_nrom('plddt') + rank_nrom('demask') + rank_nrom('ddG') + rank_nrom('blosum') ) / 14
  - the way he found the coefficients in the ensemble is by increasing 1 to the weight until it led to a lower public score
    - yep they were definitely overfitting.
- interesting to note: ppl said that CV and the public LB were unstable
Difference features
- https://www.kaggle.com/code/cdeotte/difference-features-lb-0-600
  - PDB files are protein data bank files. they describe the three-dimensional structures of molecules
  - he took a dataset of PDB files to make features
Showing that there is a high correlation between the distance between the mutation’s (xyz) and the enzyme’s center (xyz)
- https://www.kaggle.com/code/gehallak/nesp-3d-geometry-0-32-lb
  - by using the sole feature mutation coord - enzyme center coord as the predicted value of tm, he got a public score of 0.32

Solutions

(1st) Protein as a Graph
- https://www.kaggle.com/competitions/novozymes-enzyme-stability-prediction/discussion/376371
- “there was a big shakeup, so this isn’t the best solution”
- Nodes represent residues, and if two residues are closer than a certain distance, the nodes are connected by edges.
- The features of each node are the embedding vectors of wildtype sequence and mutant sequence generated by HuggingFace’s facebook/esm2_t33_650M_UR50D
- graph centrality - measure of the influence of a node in a network
- they mention dTm, but nobody else mentioned this.
  - I assume it’s the difference between tm and (mutation coord - enzym center cord) in the train data
- this was the code he used to create the graph from the pdb (3D coords of protein) files:
  - https://www.kaggle.com/code/gyozzza/create-graph-data-from-pdb-files-for-gnn
  - how it works:
    - each node in the graph is an amino acid in the pdb file (which tells us its xyz)
    - 1. for each protein, create the distance matrix, which is the euclidian distance between any two amino acids in the protein
      - get the adjacency mat for amino acids if their distance is < a threshold:
      - adj = distance_matrix < distance_threshold
      - u, v = np.nonzero(adj)
  - uses https://www.dgl.ai/
- I don’t understand what they mean when they say that they are using a graph isomorphism network.
  - are they comparing the mutation of the enzyme to the original enzyme, and using the original enzyme as a reference for the mutation, so that the tm value is just an offset of the tm of the original enzyme?
  - I guess training an IGN outputs an embedding that represents the graph. now that they have an embedding, they can use it as a feature vector for downstream models

Takeaways

TODO

🏖️ Kaggle Solutions

Explorer

Novozymes Enzyme Stability Prediction

Summary

Important notebooks/discussions

EDA and explaining what this competition is

highest public kernel before the contest end

Difference features

Showing that there is a high correlation between the distance between the mutation’s (xyz) and the enzyme’s center (xyz)

Solutions

(1st) Protein as a Graph

Takeaways

Table of Contents

Backlinks