🏖️ Kaggle Solutions

Search

SearchSearch
      • Bengali.AI Speech Recognition
      • CAFA 5 Protein Function Prediction
      • Child Mind Institute - Detect Sleep States
      • CommonLit - Evaluate Student Summaries
      • G-Research Crypto Forecasting
      • Google - Fast or Slow? Predict AI Model Runtime
      • Google QUEST Q&A Labeling
      • ICR - Identifying Age-Related Conditions
      • LANL Earthquake Prediction
      • Linking Writing Processes to Writing Quality
      • Mechanisms of Action (MoA) Prediction
      • MLB Player Digital Engagement Forecasting
      • NeurIPS 2023 - Machine Unlearning
      • Novozymes Enzyme Stability Prediction
      • Open Problems – Single-Cell Perturbations
      • RSNA 2023 Abdominal Trauma Detection
      • Stanford Ribonanza RNA Folding
      • Tweet Sentiment Extraction
      • UBC Ovarian Cancer Subtype Classification and Outlier Detection (UBC-OCEAN)
      • Vesuvius Challenge - Ink Detection
      • Average Precision
      • Balanced Accuracy
      • balanced logloss
      • BCELoss
      • BCEWithLogitsLoss
      • categorical cross entropy
      • CosineEmbeddingLoss
      • cross-entropy loss
      • DiceLoss
      • F-score
      • HingeLoss
      • Huber loss
      • jaccard similarity (aka Intersection Over Union)
      • Kendall Tau correlation
      • KLDivergenceLoss
      • Levenshtein distance
      • ListMLE Loss
      • log loss
      • log-likelihood
      • LogCosh
      • Loss functions
      • MAELoss
      • MAPE loss
      • marginRankingLoss
      • Mean absolute error (MAE)
      • mean column-wise mean absolute error (MCMAE)
      • mean reciprocal rank (MRR)
      • Mean Rowwise Root Mean Squared Error
      • MSELoss
      • PairwiseHingeLoss
      • Pearson's Correlation Coefficient
      • Precision
      • ranking loss functions
      • recall
      • receiver operating characteristic curve (ROC)
      • RMSE
      • Spearman's correlation Coefficient
      • substring segmentation
      • Word Error Rate
          • activation functions
          • ELU (Exponential Linear Unit)
          • Gated Linear Units (GLU)
          • Gaussian Error Linear Unit (GELU)
          • sigmoid
          • Sigmoid Linear Unit (Swish)
          • Smish
          • Softmax
          • SwiGLU
          • cutmix
          • cutout
          • mixup
          • test time augmentation (tta)
          • cluster sampling
          • DBSCAN
          • K-nearest neighbour (KNN)
          • kmeans
          • mean-shift clustering
          • TSNE
          • UMAP dimension reduction
          • Blocked Cross-Validation
          • forward chaining cross validation
          • GroupKFold
          • hold-out cross validation
          • kfold
          • stratified kfold
          • absolute positional embedding
          • adversarial validation
          • ALiBi positional encoding
          • BPE tokenizer
          • Online hard negative mining
          • Sequence bucketing
          • singular value decomposition (SVD)
          • axial attention
          • channel shuffle
          • Convolutional Block Attention Module (CBAM)
          • GPS Layers
          • GRU
          • linformers
          • LSTM
          • pointwise convolution
          • SAGEConv
          • Spectral Graph Convolutions
          • Squeeze-and-Excitation layer
          • Squeezeformer layer
          • cosine annealing LR
          • decreasing learning rate
          • differential learning rate
          • warmup learning rate
          • AlphaFold
          • alternative targets (auxiliary objective)
          • autoencoder
          • catboost
          • denoise autoencoder
          • Graph Attention Networks (GATs)
          • Graph Auto-Encoders (GAEs)
          • Graph Convolutional Network (GCN)
          • graph isomorphism network
          • graph neural networks
          • GraphSAGE
          • HuberRegressor
          • lgbm
          • linear regression
          • Message Passing GNNs (MP-GNN)
          • multi layer perceptron
          • Pyboost
          • RANSAC
          • Relational Graph Convolutional Networks (R-GCNs)
          • RoBERTa
          • segformer
          • stacking
          • TabNet
          • TabPFN
          • Temporal Fusion Transformers (TFT)
          • xgboost
          • Adam optimizer
          • Bayes' Theorem
          • Evidence lower bound (ELBO) (aka variational Lower Bound)
          • likelihood
          • Maximum Likelihood Estimation (MLE)
        • beam search decoding
        • bilinear interpolation
        • Connectionist temporal classification (CTC)
        • curse of dimensionality
        • factor analysis
        • Fast-Fourier Transform
        • GNN Positional encodings
        • gradient accumulation
        • graph laplancian
        • Graph Segment Training
        • image augmentation
        • KS statistic
        • Lasso Regression
        • Masked Language Modeling (MLM)
        • Mel frequency cepstral coefficients (MFCC)
        • Multiple Instance Learning
        • Over-smoothing
        • overfitting yourself
        • regularization
        • ridge regression
        • Short-time Fourier Transform (STFT)
        • thresholding
        • triplet mining
        • voice activity detection (VAD)
        • binary classification
        • classification
        • hierarchical label classification
        • Image Classification
        • image segmentation
        • learning to rank
        • machine translation
        • multi-class classification
        • Multi-label Classification
        • NLP
        • Node Classification Task (using GCN)
        • ordering objects in list
        • regression
        • semantic segmentation
        • signal processing
        • speech to Text (STT)
        • Time Series
        • transcription
        • Unknown Class Classification
          • encapsulate team's code in class
          • cross validation
          • select 2 best models for each fold on CV
          • identify domain shift
          • data compression
          • identifying slow feature generation
          • learn on subsets
          • resize layer to reduce dimensions
          • speedup iteration
          • use a single embedding matrix
          • lasso feature importance for ensembling
          • Weighted Boxes Fusion (WBF) ensembling
          • Add noise to denoise median statistic
          • considering features
          • create features through the ratio between different features
          • data period selection
          • dimension reduction for feature generation
          • distribution matching
          • downscale upscale examples
          • drop outliers
          • drop redundant columns
          • extract features using NLP on academic papers
          • Fibonacci window lag
          • filling training data (impute data)
          • frequency encoding
          • Identify poor data sources
          • is present bit
          • Leave-one-out encoding
          • normalize features
          • one-hot encoding
          • permutation feature importance to select features
          • placeholder for invalid values
          • reduce resolution
          • sliding window
          • Spectrogram dithering
          • subtraction to avoid dependence on mean
          • target encoding
          • text data cleaning
          • thermometer encoding
          • Training own Tokenizer
          • Transformer to compress dimensions rather than flattening
          • entropy
          • variance
          • How to understand your place on an overfitted Leaderboard
          • leaderboard probing
          • add signal to attention bias
          • create enriching features first, then mix across time
          • expert models
          • Freezing Layers
          • Gradient-Boosted Decision Tree
          • GRU head (neck) after the backbone layer
          • kenlm
          • RAPIDS SVR
          • Stochastic Weights Averaging
          • Train on external data first
          • adding epsilon to regularize
          • DropEdge
          • dropout
          • label smoothing
          • weight decay
          • binary encoded categorical ordinal targets
          • clip outputs to be within range
          • custom labels
          • derive results from logits
          • downsample and upsample output
          • drop bad targets from CV
          • hardness to predict label
          • ignore edge of output prediction
          • percentile thresholding
          • postprocess to match target distribution
          • pseudo-labeling
          • remove stray pixels
          • target scaling
          • time since an event occurred as an auxiliary target
          • use intermediate layer results (weighted)
        • custom loss
        • remove easy examples
        • remove rows where feat=x to find unknown data clusters
        • remove test data leakage
        • Rotational positional embedding
        • sanity check
        • xpos positional encoding
        • Demucs
        • ftfy
        • Numba
        • polars
        • Ray Tune
        • runpod.io
        • thefuzz (prev fuzzywuzzy)
        • vast.ai
      • All Competitions
      • Kaggle Grandmaster Tools
    Home

    ❯

    ml concepts

    ❯

    learning rates

    ❯

    decreasing learning rate

    decreasing learning rate

    Mar 01, 2025, 1 min read

    • the idea is that your model needs to take larger steps at first, then smaller later
    • Note: for the Adam optimizer, the rate you set is the maximum learning rate that adam can use - “It’s true, that the learning rates adapt themselves during training steps, but if you want to be sure that every update step doesn’t exceed lambda you can than lower lambda using exponential decay or whatever” - https://stackoverflow.com/questions/39517431/should-we-do-learning-rate-decay-for-adam-optimizer

    Backlinks

    • MLB Player Digital Engagement Forecasting

    Created with Quartz v4.2.3 © 2025

    • Download these notes on Github!