Link: https://www.kaggle.com/c/cafa-5-protein-function-prediction

Problem Type: Multi-label Classification hierarchical label classification

Input:

    1. The fasta files for proteins (aka the amino acids that make it up) AND the corresponding GO terms this protein is related with
    • For each GO term, we also learn its aspect:
      1. Molecular Function (MF)
      2. Biological Process (BP)
      3. Cellular Component (CC)
    1. the gene ontology graphs (.obo file):
    • basically, for each GO is a “more specific subgraph” of all the aspects above^
      • e.g. BP is the root. and it points to three GOs that are more specific
      • Then each three has children that are more specific cell functions and so on
      • some GOs are leaves.
      • biologically, if a cell has a GO, then it also has the GO of all it’s parent GOs up until the root of the tree (either MP, BP, or CC)
    • a GO can have multiple parents, so it’s a DAG, not a tree

Output: for each protein, return a set of (GO term ID, probability) for it.

  • A GO term is a protein’s function. By returning the set of GO term IDs for the protein, you are saying: “this protein does these set of functions”
    • note: the probability is the model saying: “This is the probability that this protein has this function”
  • Note: GO means gene ontology

Eval Metric: F-score

Terms deep in the ontology tend to appear less frequently, be harder to predict, and thus their weights are larger (Clark & Radivojac, 2013). This does not always hold true however, as highlighted in the following discussion.

  • TODO: understand why their metric implementation was flawed. (there was a discussion post). This will teach you how this metric actually works

Summary

The goal is to figure out “what each protein does” (it’s function)

Note: This is like an oldschool kaggle competition. You submit predictions for the test directly via a tsv (tab separated value) file.

  • so competitors know what the test targets are

A good solution uses both the fasta file (getting an LLM to make predictions based on the sequence) and the gene ontology

Important notebooks

Solutions

  • Note: the classes are hierarchical: A leaf class can ONLY exist if all of its parent classes exist.

  • (2nd) graph neural net. conditional probability. gradient boosted trees just hardcore stuff

    • https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/discussion/434064
    • solution code: https://github.com/btbpanda/CAFA5-protein-function-prediction-2nd-place
      • they only did a simple 5-fold CV (nothing else was better)
      • Gradient-Boosted Decision Tree improved their models the most (their own Pyboost framework is super fast)
      • Alternative modelling approach - predicting the conditional probabilities
        • this is very smart. Basically, when training, for each protein, they tell the model: these are the OGs it is.
          • but for OGs that are not a parent of a nonzero OG, that value is replaced with NaN, so the model ONLY LEARNS VALID OGs for that protein
            • the three values are 0, 1, and NaN.
            • 0 since a parent is technically a superclass of the OG, so if the model can’t be as precise and predict the lowest child node, predicting a parent node is still good
        • calculating the conditional probability is also very smart:
          • p_term_raw = p_term_cond * (1 - (1 - p_parent_0_raw) * (1 - p_parent_1_raw) * ... * (1 - p_parent_N_raw))
            • and processing the parent nodes first, so the child nodes can use the parent probabilities just makes sense. very cool
        • tbh I’m not sure what this is. gotta look at the code
          • in which scenarios is the target 0 or 1?
          • ik when it’s NaN, but itss not clear.
          • how can setting some values to 0 or 1 set the conditiona probability?
            • when is something 0???? should it be nan if it’s not a child? I just don’t understand this part
          • This is the entire code
            			trg = np.zeros((num.shape[0], ont.idxs), dtype=np.float32)
            			np.add.at(trg, (trm_ont['n'].values, trm_ont['id'].values), 1)
             
            			if args.propagate:
            				propagate_target(trg, ont)
             
            			# create NaNs from graph
            			for k, node in enumerate(ont.terms_list):
            				adj = node['adj']
            				if len(adj) > 0:
            					na = np.nonzero(np.nansum(trg[:, adj], axis=1) == 0)[0]
            					assert np.nansum(trg[na, k]) == 0, 'Should be empty'
            					trg[na, k] = np.nan
            • I’m not sure what trm_ont[‘n’].values means. This code is hard to understand without running
      • used a Graph Convolutional Network (GCN) for the Node Classification Task (using GCN)
      • postprocessing
        • clip outputs to be within range
          • cutoff_threshold_low = 0.1 # prediction < cutoff_threshold_low will be set to zero (i.e. no need to save to submission file)
        • by looking at the ontology graph, for the predicted OG, they need make sure that the assigned probability makes sense
        • so they look up all the parents for the OG.
          • They need to make sure that the predicted probability is never higher than a probability for a parent
            • perhaps they get this probability from the training set if it exists
          • doing this check is important because their models don’t take this into account (some aren’t aware of the graph relationships)
  • (3rd) Used computationally hypothesized GO labels as features

    • https://www.kaggle.com/competitions/cafa-5-protein-function-prediction/discussion/464437
    • in addition to the ground truth GO labels, there are other GO labels that were computationally hypothesized (non experimental)
    • Since the test_x will be experimentally validated in the future, he treated the train/validation/private test split like it was time series data
      • This feels weird, but I guess it doesn’t matter, cause as long as the validation data is diff form your train it’s good. but it’s weird since you might not be validating your model properly (the metric may not be stable since it’s a bit shifted from your train data)
        • but I guess he had to do it this way, cause he knows that in the future, the test (private) data might be collected with more precision, so the drift is expected
  • (4th) extract features using NLP on academic papers

Takeaways

  • Using language models (e.g. T5 or ProtBERT) to extract embeddings from the protein sequence is highly recommended