Link: https://www.kaggle.com/competitions/open-problems-single-cell-perturbations

Problem Type: binary classification

Input: cell types and small molecule names

Output: A 18,211 dim vector for each row. These are the probabilities that the molecule (sm_name) will affect each of the 18,211 gene’s expression when applied to this cell_type.

Eval Metric: Mean Rowwise Root Mean Squared Error

Summary

This is basically a regression problem with 2 feature columns and 18211 targets

Note: an estimated 35% of the training data is erroneous: https://www.kaggle.com/code/jalilnourisa/post-eda

  • This could affect the validity of techniques used in this competition

Solutions

  • (1st) ChemBERTa

    • https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/discussion/459258
    • generated features by embedding the description of the cell
      • but these text embeddings decreased his score.
      • He also tried fine-tuning these text embeddings but it didn’t work
    • cross validation: 5fold CV
    • ChemBERTa embeddings of the SMILES encodings helped tremendously
    • other features he used: the mean, standard deviation, and (25%, 50%, 75%) percentiles per cell type and small molecule
      • TODO: what is he aggregating over? A cell type is a class, not a number!
    • feature representations
      • initial”: ChemBERTa embeddings, 1 hot encoding of cell_type/sm_name pairs, mean, std, percentiles of targets per cell_type and sm_name
      • “light”: ChemBERTa embeddings, 1 hot encoding of cell_type/sm_name pairs, mean targets per cell_type and sm_name
      • “heavy”: ChemBERTa embeddings, 1 hot encoding of cell_type/sm_name pairs, mean, 25%, 50%, 75% percentiles of targets per cell_type and sm_name
    • model selection:
      • didn’t work: gradient boosting models, MLP, and 2D CNN
      • worked: LSTM, GRU, 1D CNN
    • Loss function:
      • 0.32MSELoss + 0.24MAELoss + 0.24LogCosh + 0.2BCELoss
        • Although BCE is for binary classification, it was used cause it “sends better signals to the models and optimizers when the target values are close to zero”
        • e.g.
          • BCELoss(0.05, -0.05) = 0.694
          • MSELoss(0.05, -0.05) = 0.010 # this loss is much smaller! However, the values are quite far!
        • Since most target values are from a Gaussian distribution with mean 0, using BCELoss will more aggressively punish imprecise predictions
    • removed padding in the ChemBERTa model improved private leaderboard results
    • also setting 30% of the input features’ entries to 0 improves the score
      • the hypothesis is:
          1. we might not need to know the complete chemical structure of a molecule to know its impact on a cell. OR
          1. there is a biological disorder in the cell, but we still expect it to respond to the drug in the same way
      • this feels like dropout
      • NOTE: significant training data qualities could mean this technique isn’t valid
  • (2nd) target encoding

  • (3rd) 2-stage prediction. 1: create pseudolables 2: final prediction

    • https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/discussion/458750

    • the sm_name column maps one-to-one with the SMILES column. So he dropped the sm_name column

      • he tried using a neural network on the SMILES column but failed.
    • for each of the genes, he plotted the range of possible values and found that the values can be from 4 to 50.

      • 50 is a big number, which can affect MSELoss or MAELoss. So he standardized the columns (divided by std) to calculate standardized mse.
    • he made sure that “Every fold contains one cell type chosen from NK cells, T cells CD4+, T cells CD8+, T regulatory cells”

      • only sm_names being in public and private test was involved
        • so you don’t train on irrelevant names
    • has a 2-stage prediction:

    • 1st stage - pseudolabel all the test data (255 rows) for more training data: (this is why he’s third!)

      • used optuna for all hyperparams:
        • dropout%, num neuron in layers, output dim of embedding layer, num epochs, learning rate, batch size, num of dimensions of truncated singular value decomposition
      • he used 4-fold CV. but ran each fold twice (prob diff seed / data shuffle)
      • the final prediction of this first stage is an ensemble of 7 models
    • 2nd stage - use train data + pseudolabelled test data

      • he used 20 models with diff hyperparams (didn’t mention seeds!)
      • more optuna
      • Models had high variance, so every model was trained 10 times on all dataset and the median of prediction is taken as a final prediction
        • used median over mean!
      • he made sure to clip the final predictions based on the min/max of each col (only from the original training data)
    • History of improvements:

      1. a replacing onehot encoding with an embedding layer
      2. a replacing MAE loss with MRRMSE loss
      3. an ensembling of models with mean
      4. a dimension reduction with truncated singular value decomposition (SVD)
      5. an ensembling of models with weighted mean
      6. using pseudolabeling
        1. using pseudolabeling and ensembling of 20 models and weighted mean.

      What did not work for me:

      • a label normalization, standardization
      • a chained regression
      • a denoising dataset
      • a removal of outliers
      • an adding noise to labels
          • Adding some noise (0.01 * std) can even improve the model’s performance.
      • a training on selected easy / hard to predict columns
      • a huber loss.
    • huber loss didn’t work!!! Also outlier removal!

  • (4th) good CV + feature engineering

    • https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/discussion/460191
    • cross validation:
    • feature engineering for their RAPIDS SVR model
        1. One-hot encoding features of cell_type and sm_name.
        1. use embeddings from ChemBERTa-10M-MTR on SMILES strings
        • During the competition, we noticed in the discussion forums that many mentioned the embedding features generated by ChemBERTa-10M-MTR did not yield positive results. In our trials, we found that these features had a negative impact on models like LGBM and CatBoost, but they improved performance in RAPIDS SVR. Therefore, we decided to use RAPIDS SVR to generate pseudo-labels as a basis for subsequent model training.
        • When screening features, they only trained and validated the model on the first target: A1BG.
          • This tightened their iteration loop to only a few mins
        1. target encoding for cell_type and sm_name
        • used [‘mean’, ‘min’, ‘max’, ‘median’, ‘first’, ‘quantile_0.4’]
    • feature engineering for their Pyboost derived from Alexander Chervov
        1. Pseudo-label features from RAPIDS SVR.
        1. Leave-one-out encoding features for cell_type and sm_name.
        • was better than one hot encoding
        1. Embedding features generated by ChemBERTa-10M-MTR for SMILES.
        1. target encoding for [‘mean’, ‘max’].
        1. reduced the dimensionality of 18,211 targets to 45 using TruncatedSVD
        • Similarly, we also reduced the dimensions of the above features to the same 45 dimensions
    • feature engineering for neural net model
      • only Leave-one-out encoding for cell_type and sm_name
        • other features decreased CV
      • also did singular value decomposition (SVD) on the target variables
      • They tried converting the SMILES into an image using the rdkit.Chem library
        • then training a non-pretrained resnet on it
        • but this didn’t help
      • ResNet18 was in this model!
    • 5% of their ensemble went to the 0720 open-source solution
      • it uses an “autoencoder method”: https://www.kaggle.com/code/vendekagonlabs/jax-autoencoder-quickstart
        • basically, they train an autoencoder on the x-values
          • but these x-values are only one-hot-encoded (i.e. they aren’t very good features)
            • which is why they only assigned 5% of their ensemble to this
        • I should check to see if they are using the variational autoencoder, or just the standard one in the notebook
    • Normalization of target data DID NOT WORK

Takeaways