
The Ultimate Cheat Sheet: Picking the Right Model, Optimizer & LR for Every Scenario
AS
Anthony SandeshIn supervised-learning, unsupervised-learning, time-series, deep-learning and reinforcement-learning tasks, each modeling problem brings its own “sweet-spot” of algorithms, solvers/optimizers, and hyperparameter defaults. Below is a practical guide to choosing models, optimizers (or solvers), learning‐rate heuristics and when to reach for each technique.
1. Regression problems
Scenario | Models | Solver / Optimizer | Learning Rate & Tips |
Simple, low-dimensional data | Linear Regression | Closed-form (Normal eq) | no LR; just scale features |
Multicollinear features | Ridge, Lasso | Coordinate descent | α (regularization) ~1e–3–1; use cross-val to pick λ |
Sparse → feature selection | Lasso | Coordinate descent | increase λ to induce sparsity; monitor #nonzero coeffs |
Nonlinear but interpretable | Decision Trees | Greedy splitting | max_depth ~3–10; min_samples_leaf ≥5 |
Better nonlinear, less over-fit | Random Forest, GBM | Tree-based (no LR) | n_estimators 100–500; learning_rate (GBM) 0.01–0.1 |
State-of-the-art boosting | XGBoost, LightGBM | Histogram & gradient | LR 0.01; early stopping; max_depth 4–8; subsample 0.5 |
2. Classification problems
Scenario | Models | Solver / Optimizer | Learning Rate & Tips |
Binary, linear separable | Logistic Regression | LBFGS / liblinear | C=1.0; scale inputs; try both L1 & L2 |
Small data, non-parametric | K-Nearest Neighbors (KNN) | — | k ≈ √n_samples; standardize distances |
Margin-based, high-dimensional | Support Vector Machines (SVM) | SMO | C=1; kernel=’rbf’; γ=1/num_features |
Probabilistic clustering | Gaussian Mixture Models (GMM) | EM | n_components: elbow plot; covariance=full vs diag |
Few samples, deep features | SVM with Word2Vec/GloVe embeddings | SMO | tune C; embeddings via gensim |
3. Unsupervised learning & dimensionality reduction
Task | Models & Techniques | Solver / Optimizer | Tips |
Dimensionality reduction | PCA | SVD | n_components ≈ 0.95 explained variance; whiten=False |
Density clustering | DBSCAN | Ball-tree/k-dtree | eps ≈ 0.5×avg_dist; min_samples≈5; good for arbitrary shapes |
Mixture modeling | Gaussian Mixture (GMM) | EM | use BIC/AIC to choose components |
Anomaly detection | Isolation Forest, One-Class SVM | Tree-based / SMO | contamination=0.01–0.1; subsample size ≈256 |
4. Time-series modelling
Scenario | Models | Solver / Optimizer | Tips |
Stationary, short history | ARIMA, SARIMA | Maximum likelihood | p,d,q via ACF/PACF; seasonal P,D,Q with m (season length) |
Exponential trends | Exponential Smoothing | Holt-Winters | α,β,γ via grid search; test additive vs multiplicative |
Nonlinear, external regressors | Gradient Boosting (GBM) | Tree-based | add lagged features; learning_rate=0.05; n_estimators=200 |
Deep sequence modeling | LSTM, GRU | Adam | lr=1e–3; batch_size=32; clip gradients at 1.0; sequence length tuning |
Distance-based similarity | Dynamic Time Warping (DTW) | — | window ≈10% of series length; use as distance for 1-NN |
5. Deep-learning architectures
Goal | Models | Optimizer | LR & Scheduling Tips |
Tabular / MLP | Feedforward MLP (Multilayer Perceptron) | Adam/SGD | lr=1e–3 (Adam) or 1e–2 (SGD); weight_decay=1e–4; step LR decay |
Image tasks | CNN (ResNet, custom conv stacks) | AdamW/SGD | lr=1e–3 (AdamW) or 0.1 (SGD w/momentum); cosine annealing |
Sequence-to-sequence/NLP | RNN / LSTM / GRU with attention | Adam | lr=5e–4; warmup steps; then linear decay |
Pretrained transformer fine-tuning | BERT / GPT / T5 | AdamW | lr=2e–5–5e–5; 0 warmup → 10% of total steps |
Representation learning | Autoencoder, VAE | Adam | lr=1e–3; β for VAE =1; increase β to enforce disentangling |
Generative modeling | GAN, DGAN | Adam | lr_D=2e–4, lr_G=2e–4; β1=0.5, β2=0.999; train D more steps |
6. Reinforcement learning
Setting | Models | Optimizer | LR & Stability Tips |
Value-based, discrete actions | DQN | Adam | lr=1e–4; replay buffer=1e6; target_update=1000 steps |
Policy gradient, continuous | Policy Gradient / Actor-Critic | Adam | lr=3e–4; entropy_coeff=0.01; normalize rewards |
Model-based RL | World Model + MPC | Adam | lr=1e–3; plan horizon tuning |
On-policy, stochastic policy | PPO, A2C | Adam | lr=2.5e–4; clip=0.2; n_steps=2048 |
Off-policy, continuous actions | DDPG, SAC | Adam | lr=3e–4; τ=0.005; α (entropy)≈0.2 |
7. Bringing it all together: model selection
Regardless of domain, the first step is to benchmark a handful of diverse models and pick the one that “just works” before diving deep into tuning. Here’s a minimal Python snippet to automate that process using scikit-learn:
How it works:
- Pipeline ensures consistent preprocessing (e.g. scaling).
- cross_val_score gives an unbiased estimate of performance.
- Inspect mean ± std accuracy to compare stability.
Once you’ve identified the leading candidate, dive into grid/random search for fine-tuning hyperparameters (e.g. learning rate, regularization strength, tree depth or number of layers).
Key takeaways
- Start with simple, interpretable models.
- Match model complexity to data size and noise.
- Use well-known default optimizers/solvers (Adam for DL, coordinate descent for L1/L2).
- Always benchmark multiple approaches before heavy tuning.
- Automate model comparison with cross-validation pipelines—then optimize the winner.
Happy modeling!


