How to do Hyperparameter Tuning for Deep Learning
Hyperparameter tuning for deep learning is different from machine learning. In machine learning we have k-fold cross validation and multiple methods for performing extensive tuning. But doing all of this method for deep learning is quite computationally intensive.
Training one neural network itself is a big effort. Now, we have to do an entire tuning for it? That itself is an expensive endeavour.
However, it is still a good practice to perform hyperparameter tuning for deep learning. Most tuning are done on learning rate itself. Because findings the best learning rate will improve the training process and reduce wastage process and unnecessary training loop. But, apart from that, we also can perform hyperparameter tuning on batch size, dropout rate , and even the network architecture itself to ensure we have the best possible network for training.
Below are the tutorial for hyperparameter tuning that we can do for deep learning.
The first step: The hold out method
In the hold out method, we split our data into three parts. The first part is for training (between 70–80%), the second part is for validaton (between 10–15%) and the last one is for testing (the remaining 10–15%). We use the validation set to evaluate the performance of the hyperparameter chosen for the training loop. And this process will be repeated for several selected hyperparameters for comparison. Finally, we will use the test set to evaluate the overall performance of the network.
Let’s demonstrate the method below with MNIST and simple neural network.We will use pytorch as our library.
#first let's load the library
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split, ConcatDataset
from torchvision.datasets import MNIST
from torchvision import transforms
from sklearn.metrics import accuracy_score
from skopt import gp_minimize
from skopt.space import Integer, Real, Categorical
from skopt.utils import use_named_args
import numpy as np#next, let's setup our device and load our data
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")Now let’s prepare the transformation for our data. Since we’re using MNIST, this is the typical normalization for MNIST
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,)) # Mean and std dev of MNIST
])Now, MNIST data already comes split between training and testing. So we need to split the training data further and use that as our validation set. So let’s load and split that as below
full_train_dataset = MNIST(root="./data", train=True, download=True, transform=transform)
test_dataset = MNIST(root="./data", train=False, download=True, transform=transform)#let's get the validation dataset by random split
train_size = 50000
val_size = len(full_train_dataset) - train_size
train_dataset, val_dataset = random_split(full_train_dataset, [train_size, val_size])Now, it’s time to load the data
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=1024, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=1024, shuffle=False)
print(f"Data Split:")
print(f" - Training set size: {len(train_dataset)}")
print(f" - Validation set size: {len(val_dataset)}")
print(f" - Test set size: {len(test_dataset)}")
print("-" * 30)If the data is loaded properly, we will have something like this
Now, let us establish the neural net architecture. This will be a simple one, with two layers, separated by activation layer. We also introduce a dropout layer, which is one of hyperparameters that we will need to tune.
class SimpleNN(nn.Module):
def __init__(self, hidden_size, activation_fn, dropout_rate):
super().__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28 * 28, hidden_size)
self.activation = activation_fn
self.dropout = nn.Dropout(dropout_rate)
self.fc2 = nn.Linear(hidden_size, 10) # 10 classes for MNIST
def forward(self, x):
x = self.flatten(x)
x = self.activation(self.fc1(x))
x = self.dropout(x)
x = self.fc2(x)
return xNow, we can just proceed with the training method. We will use the validation set to validate our training progress. The validation set accuracy can be used to find the best learning rate, determine the stopping criteria, and maybe determine batch size. But this must be done many times through multiple training loops. Which are not very efficient if you think about it.
Therefore, there is a better way to perform this tuning. For that, let’s move to the next step. Which is:
The Bayesian Optimization
Bayesian Optimization works from the principle of Bayes theorem. It’s the principle of changing beliefs. The simple way to understand this is, say that we are a human, who has our own beliefs.
For example, let’s say that the beliefs is “pizza is better with pineapple”. But over the years, with repeated observations, we update those beliefs. We finally say that “pizza with pineapple is a travesty” So, we update our beliefs based on observations.
But, frame this problem in more probabilistic way. Instead of just takes our own observation, we make into a survery. Maybe in the past, within our circle of friends, 7/10 people will say that pizza with pineapple is great. But, over time, when we change our circle, the people within our circle might say that pizza with pineapple is horrible. So, we will update that in our probabilities.
Similarly, in Bayesian Optimization, we start with a prior belief about which hyperparameters are likely to perform well. As we evaluate different hyperparameter combinations and observe their performance (e.g., validation loss), we update our beliefs using Bayes’ theorem. This helps guide the search toward more promising regions of the hyperparameter space.
Now, we have understood a little bit about the theory, let us put this into practice. The first step is to establish the search space for our hyperparameter search. The hyperparameter that we’re focusing on is the hidden size, the activation function that we want to use, the dropout rate, and the learning rate. We can establish it as below:
search_space = [
Integer(32, 256, name='hidden_size'),
Real(1e-4, 1e-2, "log-uniform", name='learning_rate'),
Real(0.1, 0.5, name='dropout_rate'),
Categorical([nn.ReLU(), nn.Tanh()], name='activation_fn')
]Alright, now let’s establish our objective function and set our training loop.
The objective function is set in the way that we maximize the accuracy. Therefore, the objective is set as 1-accuracy.
@use_named_args(search_space)
def objective(**params):
"""
Trains a PyTorch model with given hyperparameters, evaluates on the
validation set, and returns the validation error (1 - accuracy).
"""
model_params = {
'hidden_size': params['hidden_size'],
'activation_fn': params['activation_fn'],
'dropout_rate': params['dropout_rate']
}
model = SimpleNN(**model_params).to(device)
optimizer = optim.Adam(model.parameters(), lr=params['learning_rate'])
criterion = nn.CrossEntropyLoss()
# training loop
model.train()
# We'll train for a fixed number of epochs for each hyperparameter set
for epoch in range(5): # A small number of epochs for faster tuning
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
# validation loop
model.eval()
all_preds = []
all_targets = []
with torch.no_grad():
for data, target in val_loader:
data, target = data.to(device), target.to(device)
output = model(data)
preds = torch.argmax(output, dim=1)
all_preds.extend(preds.cpu().numpy())
all_targets.extend(target.cpu().numpy())
accuracy = accuracy_score(all_targets, all_preds)
# the bayesian objective function is set to minimize. so we set it as below
print(f"Params: {params}, Val Accuracy: {accuracy:.4f}, Error: {1.0 - accuracy:.4f}")
return 1.0 - accuracyThen with the objective function and the training loop set, let us begin the Bayesian Optimisation. We will set the number of calls to 15. Means we will search 15 different hyperparameters combination.
#now, let's run the bayesian optimisation
print("Running Bayesian Optimization...")
result = gp_minimize(
func=objective,
dimensions=search_space,
n_calls=15, # Number of different hyperparameter combinations to try
random_state=42,
n_random_starts=5
)
print("Bayesian Optimization finished.")
print("-" * 30)Once complete, we can call the best hyperparameter as below:
#let's evaluate the overall results
best_params_list = result.x
best_params = {dim.name: val for dim, val in zip(search_space, best_params_list)}
print(f"Best validation accuracy: {1.0 - result.fun:.4f}")
print("Best hyperparameters found:")
for key, value in best_params.items():
print(f" - {key}: {value}")
print("-" * 30)And if our code working correctly, we can get the results as below:
And now, with the best hyperparameter found, we can use it to finish our training. We combine back our training and validation dataset.
final_train_dataset = ConcatDataset([train_dataset, val_dataset])
final_train_loader = DataLoader(final_train_dataset, batch_size=64, shuffle=True)And run our training loop with the best hyperparameter
final_model_params = {
'hidden_size': best_params['hidden_size'],
'activation_fn': best_params['activation_fn'],
'dropout_rate': best_params['dropout_rate']
}
# Create and train the final model with the best hyperparameters
final_model = SimpleNN(**final_model_params).to(device)
optimizer = optim.Adam(final_model.parameters(), lr=best_params['learning_rate'])
criterion = nn.CrossEntropyLoss()
# Train for a few more epochs on the full data
for epoch in range(10):
final_model.train()
for data, target in final_train_loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = final_model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
print(f"Final training, Epoch {epoch+1}/10 completed.")
# Evaluate the final model on the unseen test set
final_model.eval()
all_preds = []
all_targets = []
with torch.no_grad():
for data, target in test_loader:
data, target = data.to(device), target.to(device)
output = final_model(data)
preds = torch.argmax(output, dim=1)
all_preds.extend(preds.cpu().numpy())
all_targets.extend(target.cpu().numpy())
test_accuracy = accuracy_score(all_targets, all_preds)
print(f"\nFinal Model Performance on the Unseen Test Set:")
print(f"Accuracy: {test_accuracy:.4f}")
print("\nThis final accuracy is our estimate of how the model will perform on new, real-world data.")And with that, we got our best model
Alright, that’s all for today. A very short one. Let’s meet again next time.





