Hyperparameter-tuning for Deep Learning Models with the Ray — Simple Pytorch Example
Once we finish designing our deep models, the next things that need to be decided are the hyperparameters. There are many packages available for hyperparameter-tuning, which are summarized in this fantastic article. But today I’m not going to talk about all of them, rather am showing a demo example on one specific package called the Ray, which I found very handy in wrapping my existing code.
The Ray package: https://docs.ray.io/en/latest/index.html
For installation, you can simply follow the instructions here, which are very straightforward: https://docs.ray.io/en/latest/installation.html
I assume when you read this article, you already have a deep model written, and are just looking for a convenient way for hyperparameter-tuning. That’s why I’m using the following Pytorch example of MNIST classification to show how to use the Ray for plug-in and play.
Note: this example is simply adopted from the Ray tutorial. I converted it to Jupyter notebook so that you can easily run in on the cloud and see the power of the Ray. I’ll also show you how to set the tune.run so that you can modify the configuration parameters to fit your own need.
Let’s begin.
The code is pretty straightforward. We import all necessary modules first.
# Original Code here:# https://github.com/pytorch/examples/blob/master/mnist/main.pyimport osimport argparsefrom filelock import FileLockimport torchimport torch.nn as nnimport torch.nn.functional as Fimport torch.optim as optimfrom torchvision import datasets, transforms!pip install rayimport rayfrom ray import tunefrom ray.tune.schedulers import AsyncHyperBandScheduler
Then we create our simple CNN model:
# Change these values if you want the training to run quicker or slower.EPOCH_SIZE = 512TEST_SIZE = 256# define the network with 1 convolutional layer + 2 FC layersclass ConvNet(nn.Module): def __init__(self): super(ConvNet, self).__init__() self.conv1 = nn.Conv2d(1, 3, kernel_size=3) self.fc = nn.Linear(192, 10) def forward(self, x): x = F.relu(F.max_pool2d(self.conv1(x), 3)) x = x.view(-1, 192) x = self.fc(x) return F.log_softmax(x, dim=1)
Define the corresponding Training/Testing/DataLoader functions. Note that we haven’t used anything from the Ray yet.
def train(model, optimizer, train_loader, device=None): device = device or torch.device("cpu") model.train() for batch_idx, (data, target) in enumerate(train_loader): if batch_idx * len(data) > EPOCH_SIZE: return data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step()
def test(model, data_loader, device=None): device = device or torch.device("cpu") model.eval() correct = 0 total = 0 with torch.no_grad(): for batch_idx, (data, target) in enumerate(data_loader): if batch_idx * len(data) > TEST_SIZE: break data, target = data.to(device), target.to(device) outputs = model(data) _, predicted = torch.max(outputs.data, 1) total += target.size(0) correct += (predicted == target).sum().item() return correct / total
def get_data_loaders(): mnist_transforms = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.1307, ), (0.3081, ))]) # We add FileLock here because multiple workers will want to # download data, and this may cause overwrites since # DataLoader is not threadsafe. with FileLock(os.path.expanduser("~/data.lock")): train_loader = torch.utils.data.DataLoader( datasets.MNIST( "~/data", train=True, download=True, transform=mnist_transforms), batch_size=64, shuffle=True) test_loader = torch.utils.data.DataLoader( datasets.MNIST("~/data", train=False, transform=mnist_transforms), batch_size=64, shuffle=True) return train_loader, test_loader
Now we put together the train and test functions and add the tune.report in the end.
def train_mnist(config): use_cuda = torch.cuda.is_available() device = torch.device("cuda" if use_cuda else "cpu") train_loader, test_loader = get_data_loaders() model = ConvNet().to(device) optimizer = optim.SGD(model.parameters(), lr=config["lr"], momentum=config["momentum"]) while True: train(model, optimizer, train_loader, device) acc = test(model, test_loader, device) # Set this to run Tune. tune.report(mean_accuracy=acc) # if you want to evaluate loss, change it to mean_loss and set the mode in tune.run to min instead
Now it’s the key part for Ray configuration. There are a few things to note:
1. you can change the “HyperOptSearch” to other functions, depending on the optimization method you want to adopt.
2. in tune.run, it’s important to specify the local_dir parameter, as tune would automatically change your working directory to your home directory.
3. the mode has to match the metric. If you choose metric=”mean_loss”, the mode as to be set as “max.”
from ray.tune.suggest.hyperopt import HyperOptSearchfrom ray.tune.suggest import ConcurrencyLimiterray.init(num_cpus=8, num_gpus=4) # assign the total # of cpus and gpus, make sure you have ray.init in the beginning and ray.shutdown at the endsched = AsyncHyperBandScheduler() # set a scheduleralgo = HyperOptSearch() # if you want to use the Bayesian optimization, import BayesOptSearch instead
algo = ConcurrencyLimiter(algo, max_concurrent=4)
# See the list of all search algorithms here: https://docs.ray.io/en/latest/tune/api_docs/suggestion.htmlanalysis = tune.run( train_mnist, # the core training/testing of your model local_dir=os.getcwd(), # for saving the log files name="exp", # name for the result directory metric='mean_accuracy', mode='max', search_alg=algo, scheduler=sched, stop={ "mean_accuracy": 0.98, "training_iteration": 100 }, resources_per_trial={ "cpu": 2, "gpu": 1 },num_samples=50, # 50 trialsconfig={ "lr": tune.loguniform(1e-4, 1e-2), "momentum": tune.uniform(0.1, 0.9), })print("Best config is:", analysis.best_config)ray.shutdown()
Now we can see the training process. Below is a snapshot at the beginning of the training stage, we can see 4 trials are running (as we requested 4 GPUs and each trial is limited by 1 GPU)
After running 50 trials:
And you’ll see all the log files are under your local_dir/exp folder.
Hope this is useful for you!