The Ultimate Cheat Sheet: Picking the Right Model, Optimizer & LR for Every Scenario

In supervised-learning, unsupervised-learning, time-series, deep-learning and reinforcement-learning tasks, each modeling problem brings its own “sweet-spot” of algorithms, solvers/optimizers, and hyperparameter defaults. Below is a practical guide to choosing models, optimizers (or solvers), learning‐rate heuristics and when to reach for each technique.

1. Regression problems

Scenario	Models	Solver / Optimizer	Learning Rate & Tips
Simple, low-dimensional data	Linear Regression	Closed-form (Normal eq)	no LR; just scale features
Multicollinear features	Ridge, Lasso	Coordinate descent	α (regularization) ~1e–3–1; use cross-val to pick λ
Sparse → feature selection	Lasso	Coordinate descent	increase λ to induce sparsity; monitor #nonzero coeffs
Nonlinear but interpretable	Decision Trees	Greedy splitting	max_depth ~3–10; min_samples_leaf ≥5
Better nonlinear, less over-fit	Random Forest, GBM	Tree-based (no LR)	n_estimators 100–500; learning_rate (GBM) 0.01–0.1
State-of-the-art boosting	XGBoost, LightGBM	Histogram & gradient	LR 0.01; early stopping; max_depth 4–8; subsample 0.5

2. Classification problems

Scenario	Models	Solver / Optimizer	Learning Rate & Tips
Binary, linear separable	Logistic Regression	LBFGS / liblinear	C=1.0; scale inputs; try both L1 & L2
Small data, non-parametric	K-Nearest Neighbors (KNN)	—	k ≈ √n_samples; standardize distances
Margin-based, high-dimensional	Support Vector Machines (SVM)	SMO	C=1; kernel=’rbf’; γ=1/num_features
Probabilistic clustering	Gaussian Mixture Models (GMM)	EM	n_components: elbow plot; covariance=full vs diag
Few samples, deep features	SVM with Word2Vec/GloVe embeddings	SMO	tune C; embeddings via gensim

3. Unsupervised learning & dimensionality reduction

Task	Models & Techniques	Solver / Optimizer	Tips
Dimensionality reduction	PCA	SVD	n_components ≈ 0.95 explained variance; whiten=False
Density clustering	DBSCAN	Ball-tree/k-dtree	eps ≈ 0.5×avg_dist; min_samples≈5; good for arbitrary shapes
Mixture modeling	Gaussian Mixture (GMM)	EM	use BIC/AIC to choose components
Anomaly detection	Isolation Forest, One-Class SVM	Tree-based / SMO	contamination=0.01–0.1; subsample size ≈256

4. Time-series modelling

Scenario	Models	Solver / Optimizer	Tips
Stationary, short history	ARIMA, SARIMA	Maximum likelihood	p,d,q via ACF/PACF; seasonal P,D,Q with m (season length)
Exponential trends	Exponential Smoothing	Holt-Winters	α,β,γ via grid search; test additive vs multiplicative
Nonlinear, external regressors	Gradient Boosting (GBM)	Tree-based	add lagged features; learning_rate=0.05; n_estimators=200
Deep sequence modeling	LSTM, GRU	Adam	lr=1e–3; batch_size=32; clip gradients at 1.0; sequence length tuning
Distance-based similarity	Dynamic Time Warping (DTW)	—	window ≈10% of series length; use as distance for 1-NN

5. Deep-learning architectures

Goal	Models	Optimizer	LR & Scheduling Tips
Tabular / MLP	Feedforward MLP (Multilayer Perceptron)	Adam/SGD	lr=1e–3 (Adam) or 1e–2 (SGD); weight_decay=1e–4; step LR decay
Image tasks	CNN (ResNet, custom conv stacks)	AdamW/SGD	lr=1e–3 (AdamW) or 0.1 (SGD w/momentum); cosine annealing
Sequence-to-sequence/NLP	RNN / LSTM / GRU with attention	Adam	lr=5e–4; warmup steps; then linear decay
Pretrained transformer fine-tuning	BERT / GPT / T5	AdamW	lr=2e–5–5e–5; 0 warmup → 10% of total steps
Representation learning	Autoencoder, VAE	Adam	lr=1e–3; β for VAE =1; increase β to enforce disentangling
Generative modeling	GAN, DGAN	Adam	lr_D=2e–4, lr_G=2e–4; β1=0.5, β2=0.999; train D more steps

6. Reinforcement learning

Setting	Models	Optimizer	LR & Stability Tips
Value-based, discrete actions	DQN	Adam	lr=1e–4; replay buffer=1e6; target_update=1000 steps
Policy gradient, continuous	Policy Gradient / Actor-Critic	Adam	lr=3e–4; entropy_coeff=0.01; normalize rewards
Model-based RL	World Model + MPC	Adam	lr=1e–3; plan horizon tuning
On-policy, stochastic policy	PPO, A2C	Adam	lr=2.5e–4; clip=0.2; n_steps=2048
Off-policy, continuous actions	DDPG, SAC	Adam	lr=3e–4; τ=0.005; α (entropy)≈0.2

7. Bringing it all together: model selection

Regardless of domain, the first step is to benchmark a handful of diverse models and pick the one that “just works” before diving deep into tuning. Here’s a minimal Python snippet to automate that process using scikit-learn:

How it works:

Pipeline ensures consistent preprocessing (e.g. scaling).

cross_val_score gives an unbiased estimate of performance.

Inspect mean ± std accuracy to compare stability.

Once you’ve identified the leading candidate, dive into grid/random search for fine-tuning hyperparameters (e.g. learning rate, regularization strength, tree depth or number of layers).

Key takeaways
Start with simple, interpretable models.
Match model complexity to data size and noise.
Use well-known default optimizers/solvers (Adam for DL, coordinate descent for L1/L2).
Always benchmark multiple approaches before heavy tuning.
Automate model comparison with cross-validation pipelines—then optimize the winner.

Happy modeling!

1. Regression problems

Scenario	Models	Solver / Optimizer	Learning Rate & Tips
Simple, low-dimensional data	Linear Regression	Closed-form (Normal eq)	no LR; just scale features
Multicollinear features	Ridge, Lasso	Coordinate descent	α (regularization) ~1e–3–1; use cross-val to pick λ
Sparse → feature selection	Lasso	Coordinate descent	increase λ to induce sparsity; monitor #nonzero coeffs
Nonlinear but interpretable	Decision Trees	Greedy splitting	max_depth ~3–10; min_samples_leaf ≥5
Better nonlinear, less over-fit	Random Forest, GBM	Tree-based (no LR)	n_estimators 100–500; learning_rate (GBM) 0.01–0.1
State-of-the-art boosting	XGBoost, LightGBM	Histogram & gradient	LR 0.01; early stopping; max_depth 4–8; subsample 0.5

2. Classification problems

Scenario	Models	Solver / Optimizer	Learning Rate & Tips
Binary, linear separable	Logistic Regression	LBFGS / liblinear	C=1.0; scale inputs; try both L1 & L2
Small data, non-parametric	K-Nearest Neighbors (KNN)	—	k ≈ √n_samples; standardize distances
Margin-based, high-dimensional	Support Vector Machines (SVM)	SMO	C=1; kernel=’rbf’; γ=1/num_features
Probabilistic clustering	Gaussian Mixture Models (GMM)	EM	n_components: elbow plot; covariance=full vs diag
Few samples, deep features	SVM with Word2Vec/GloVe embeddings	SMO	tune C; embeddings via gensim

3. Unsupervised learning & dimensionality reduction

Task	Models & Techniques	Solver / Optimizer	Tips
Dimensionality reduction	PCA	SVD	n_components ≈ 0.95 explained variance; whiten=False
Density clustering	DBSCAN	Ball-tree/k-dtree	eps ≈ 0.5×avg_dist; min_samples≈5; good for arbitrary shapes
Mixture modeling	Gaussian Mixture (GMM)	EM	use BIC/AIC to choose components
Anomaly detection	Isolation Forest, One-Class SVM	Tree-based / SMO	contamination=0.01–0.1; subsample size ≈256

4. Time-series modelling

Scenario	Models	Solver / Optimizer	Tips
Stationary, short history	ARIMA, SARIMA	Maximum likelihood	p,d,q via ACF/PACF; seasonal P,D,Q with m (season length)
Exponential trends	Exponential Smoothing	Holt-Winters	α,β,γ via grid search; test additive vs multiplicative
Nonlinear, external regressors	Gradient Boosting (GBM)	Tree-based	add lagged features; learning_rate=0.05; n_estimators=200
Deep sequence modeling	LSTM, GRU	Adam	lr=1e–3; batch_size=32; clip gradients at 1.0; sequence length tuning
Distance-based similarity	Dynamic Time Warping (DTW)	—	window ≈10% of series length; use as distance for 1-NN

5. Deep-learning architectures

Goal	Models	Optimizer	LR & Scheduling Tips
Tabular / MLP	Feedforward MLP (Multilayer Perceptron)	Adam/SGD	lr=1e–3 (Adam) or 1e–2 (SGD); weight_decay=1e–4; step LR decay
Image tasks	CNN (ResNet, custom conv stacks)	AdamW/SGD	lr=1e–3 (AdamW) or 0.1 (SGD w/momentum); cosine annealing
Sequence-to-sequence/NLP	RNN / LSTM / GRU with attention	Adam	lr=5e–4; warmup steps; then linear decay
Pretrained transformer fine-tuning	BERT / GPT / T5	AdamW	lr=2e–5–5e–5; 0 warmup → 10% of total steps
Representation learning	Autoencoder, VAE	Adam	lr=1e–3; β for VAE =1; increase β to enforce disentangling
Generative modeling	GAN, DGAN	Adam	lr_D=2e–4, lr_G=2e–4; β1=0.5, β2=0.999; train D more steps

6. Reinforcement learning

Setting	Models	Optimizer	LR & Stability Tips
Value-based, discrete actions	DQN	Adam	lr=1e–4; replay buffer=1e6; target_update=1000 steps
Policy gradient, continuous	Policy Gradient / Actor-Critic	Adam	lr=3e–4; entropy_coeff=0.01; normalize rewards
Model-based RL	World Model + MPC	Adam	lr=1e–3; plan horizon tuning
On-policy, stochastic policy	PPO, A2C	Adam	lr=2.5e–4; clip=0.2; n_steps=2048
Off-policy, continuous actions	DDPG, SAC	Adam	lr=3e–4; τ=0.005; α (entropy)≈0.2

7. Bringing it all together: model selection

How it works:

Pipeline ensures consistent preprocessing (e.g. scaling).

cross_val_score gives an unbiased estimate of performance.

Inspect mean ± std accuracy to compare stability.

Once you’ve identified the leading candidate, dive into grid/random search for fine-tuning hyperparameters (e.g. learning rate, regularization strength, tree depth or number of layers).

Key takeaways
Start with simple, interpretable models.
Match model complexity to data size and noise.
Use well-known default optimizers/solvers (Adam for DL, coordinate descent for L1/L2).
Always benchmark multiple approaches before heavy tuning.
Automate model comparison with cross-validation pipelines—then optimize the winner.

Happy modeling!

The Ultimate Cheat Sheet: Picking the Right Model, Optimizer & LR for Every Scenario

1. Regression problems

2. Classification problems

3. Unsupervised learning & dimensionality reduction

4. Time-series modelling

5. Deep-learning architectures

6. Reinforcement learning

7. Bringing it all together: model selection

More posts

The Ultimate Cheat Sheet: Picking the Right Model, Optimizer & LR for Every Scenario

1. Regression problems

2. Classification problems

3. Unsupervised learning & dimensionality reduction

4. Time-series modelling

5. Deep-learning architectures

6. Reinforcement learning

7. Bringing it all together: model selection

More posts