Introduction

GoogLeNet was first introduced in Google’s paper, Going deeper with convolutions. The paper proposes the Inception V1 / GoogLeNet architecture, which ranked 1st in the classification track of ILSVRC-2014 (Top-5 Error = 6.67%). The model has only about 6.8 million parameters—9× fewer than AlexNet, and 20× fewer than VGG-16—so it is much more lightweight.

In this post, I briefly introduce GoogLeNet and share the results of a small experiment where I “randomly” modified the GoogLeNet model.

GoogLeNet Architecture Overview

In many cases, it’s not obvious when we should use max-pooling versus convolution. GoogLeNet essentially uses them all at once: it applies convolutions with different kernel sizes and max-pooling in parallel, then concatenates their outputs. This structure is called an Inception module, and GoogLeNet is composed of many stacked Inception modules.

GoogLeNet

The diagram above illustrates an Inception module. Another key idea in GoogLeNet is the concept of a bottleneck. In the figure, the left is a “standard” Inception module, while the right is a modified version that introduces 1×1 convolutions. With 1×1 conv, we can greatly reduce the number of parameters—hence the term bottleneck.

GoogLeNet Inception Parameters

Without the bottleneck, the number of MACs (Multiply–Accumulate Operations) is $((28\times 28\times 5\times 5)\times 192)\times 32 ≃ 120$.

With the help of 1×1 conv to reduce computation, the MACs become the first layer $((28\times 28\times 1\times 1)\times 192)\times 16 ≃ 2.4M$ plus the second layer $((28\times 28\times 5\times 5)\times 16)\times 32 ≃ 10M $—about $12.4M$ total. You can see that the MAC count drops by roughly an order of magnitude, and in practice the parameter count is also reduced by around 10×.

GoogLeNet Architecture

The figure above shows the overall GoogLeNet architecture. Roughly speaking, it contains nine Inception modules.

GoogLeNet Parameter Table

For detailed configuration, refer to the table above.

Implementing GoogLeNet in PyTorch

The code isn’t long, so I’m pasting it in full. It’s basically the MNIST example from PyTorch, adapted to use CIFAR, and it uses the GoogLeNet module from the pytorch-cifar100 project.

If you copy the code below, it should run directly. My environment is Python 3 + PyTorch 1.6 + CUDA 10.2.

from __future__ import print_function
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import transforms
from torch.optim.lr_scheduler import StepLR
import numpy as np

class Inception(nn.Module):
    def __init__(self, input_channels, n1x1, n3x3_reduce, n3x3, n5x5_reduce, n5x5, pool_proj):
        super().__init__()

        # 1x1conv branch
        self.b1 = nn.Sequential(
            nn.Conv2d(input_channels, n1x1, kernel_size=1),
            nn.BatchNorm2d(n1x1),
            nn.ReLU(inplace=True)
        )

        # 1x1conv -> 3x3conv branch
        self.b2 = nn.Sequential(
            nn.Conv2d(input_channels, n3x3_reduce, kernel_size=1),
            nn.BatchNorm2d(n3x3_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(n3x3_reduce, n3x3, kernel_size=3, padding=1),
            nn.BatchNorm2d(n3x3),
            nn.ReLU(inplace=True)
        )

        # 1x1conv -> 5x5conv branch
        # we use 2 3x3 conv filters stacked instead
        # of 1 5x5 filters to obtain the same receptive
        # field with fewer parameters
        self.b3 = nn.Sequential(
            nn.Conv2d(input_channels, n5x5_reduce, kernel_size=1),
            nn.BatchNorm2d(n5x5_reduce),
            nn.ReLU(inplace=True),
            nn.Conv2d(n5x5_reduce, n5x5, kernel_size=3, padding=1),
            nn.BatchNorm2d(n5x5, n5x5),
            nn.ReLU(inplace=True),
            nn.Conv2d(n5x5, n5x5, kernel_size=3, padding=1),
            nn.BatchNorm2d(n5x5),
            nn.ReLU(inplace=True)
        )

        # 3x3pooling -> 1x1conv
        # same conv
        self.b4 = nn.Sequential(
            nn.MaxPool2d(3, stride=1, padding=1),
            nn.Conv2d(input_channels, pool_proj, kernel_size=1),
            nn.BatchNorm2d(pool_proj),
            nn.ReLU(inplace=True)
        )

    def forward(self,\times):
        return torch.cat([self.b1(x), self.b2(x), self.b3(x), self.b4(x)], dim=1)


class GoogleNet(nn.Module):

    def __init__(self, num_class=100):
        super().__init__()
        self.prelayer = nn.Sequential(
            nn.Conv2d(3, 192, kernel_size=3, padding=1),
            nn.BatchNorm2d(192),
            nn.ReLU(inplace=True)
        )

        # although we only use 1 conv layer as prelayer,
        # we still use name a3, b3.......
        self.a3 = Inception(192, 64, 96, 128, 16, 32, 32)
        self.b3 = Inception(256, 128, 128, 192, 32, 96, 64)

        # """In general, an Inception network is a network consisting of
        # modules of the above type stacked upon each other, with occasional
        # max-pooling layers with stride 2 to halve the resolution of the
        # grid"""
        self.maxpool = nn.MaxPool2d(3, stride=2, padding=1)

        self.a4 = Inception(480, 192, 96, 208, 16, 48, 64)
        self.b4 = Inception(512, 160, 112, 224, 24, 64, 64)
        self.c4 = Inception(512, 128, 128, 256, 24, 64, 64)
        self.d4 = Inception(512, 112, 144, 288, 32, 64, 64)
        self.e4 = Inception(528, 256, 160, 320, 32, 128, 128)

        self.a5 = Inception(832, 256, 160, 320, 32, 128, 128)
        self.b5 = Inception(832, 384, 192, 384, 48, 128, 128)

        # input feature size: 8*8*1024
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout2d(p=0.4)
        self.linear = nn.Linear(1024, num_class)

    def forward(self,\times):
        output = self.prelayer(x)
        output = self.a3(output)
        output = self.b3(output)

        output = self.maxpool(output)

        output = self.a4(output)
        output = self.b4(output)
        output = self.c4(output)
        output = self.d4(output)
        output = self.e4(output)

        output = self.maxpool(output)

        output = self.a5(output)
        output = self.b5(output)

        # """It was found that a move from fully connected layers to
        # average pooling improved the top-1 accuracy by about 0.6%,
        # however the use of dropout remained essential even after
        # removing the fully connected layers."""
        output = self.avgpool(output)
        output = self.dropout(output)
        output = output.view(output.size()[0], -1)
        output = self.linear(output)

        return output


def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
            if args.dry_run:
                break


def test(model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0

    top5count = 0

    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model.forward(data)

            v, result = output.topk(5, 1, True, True)
            top5count += torch.eq(result, target.view(-1, 1)
                                  ).sum().int().item()

            # sum up batch loss
            test_loss += F.cross_entropy(output,
                                         target, reduction='sum').item()
            # get the index of the max log-probability
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Top 1 Error: {}/{} ({:.2f}), Top 5 Error: {}/{} ({:.2f})\n'.format(
        test_loss,
        len(test_loader.dataset) - correct, len(test_loader.dataset),
        1 - correct / len(test_loader.dataset),
        len(test_loader.dataset) - top5count, len(test_loader.dataset),
        1 - top5count / len(test_loader.dataset),
    ))


def main():
    # Training settings
    parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
    parser.add_argument('--batch-size', type=int, default=64, metavar='N',
                        help='input batch size for training (default:   64)')
    parser.add_argument('--test-batch-size', type=int, default=10, metavar='N',
                        help='input batch size for testing (default: 10)')
    parser.add_argument('--epochs', type=int, default=14, metavar='N',
                        help='number of epochs to train (default: 14)')
    parser.add_argument('--lr', type=float, default=1.0, metavar='LR',
                        help='learning rate (default: 1.0)')
    parser.add_argument('--gamma', type=float, default=0.7, metavar='M',
                        help='Learning rate step gamma (default: 0.7)')
    parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
    parser.add_argument('--dry-run', action='store_true', default=False,
                        help='quickly check a single pass')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--log-interval', type=int, default=10, metavar='N',
                        help='how many batches to wait before logging training status')
    parser.add_argument('--save-model', action='store_true', default=False,
                        help='For Saving the current Model')
    args = parser.parse_args()
    use_cuda = not args.no_cuda and torch.cuda.is_available()

    torch.manual_seed(args.seed)

    device = torch.device("cuda:0")

    train_kwargs = {'batch_size': args.batch_size}
    test_kwargs = {'batch_size': args.test_batch_size}
    if use_cuda:
        cuda_kwargs = {'num_workers': 2,
                       'pin_memory': True,
                       'shuffle': True}
        train_kwargs.update(cuda_kwargs)
        test_kwargs.update(cuda_kwargs)

    transform = transforms.Compose(
        [transforms.RandomHorizontalFlip(p=0.5),
         transforms.ToTensor(),
         transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    trainset = torchvision.datasets.CIFAR100(root='./data', train=True,
                                             download=True, transform=transform)
    trainloader = torch.utils.data.DataLoader(trainset, **train_kwargs)

    testset = torchvision.datasets.CIFAR100(root='./data', train=False,
                                            download=True, transform=transform)
    testloader = torch.utils.data.DataLoader(testset, **test_kwargs)

    model = GoogleNet().to(device)
    optimizer = optim.Adadelta(model.parameters(), lr=args.lr)

    scheduler = StepLR(optimizer, step_size=1, gamma=args.gamma)
    for epoch in range(1, args.epochs + 1):
        train(args, model, device, trainloader, optimizer, epoch)
        test(model, device, testloader)
        scheduler.step()

    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")

    model_parameters = filter(lambda p: p.requires_grad, model.parameters())
    params = sum([np.prod(p.size()) for p in model_parameters])
    print("Parameters:", params)


if __name__ == '__main__':
    main()

A Small GoogLeNet Experiment

Next, I ran a few experiments to see how GoogLeNet performs. I tested:

  • The bottleneck version of GoogLeNet
  • A naïve version without bottlenecks (“Naïve GoogLeNet”)
  • A “GoogLeNet Long” variant by arbitrarily adding two Inception modules
  • A “GoogLeNet Short” variant by arbitrarily removing some Inception layers

The most aggressively reduced one is “GoogLeNet Short4”, which only has two Inception modules left. You can roughly infer the model size by looking at the parameter count.

I ran these models on both CIFAR-100 and CIFAR-10, and recorded Top-1 Error, Top-5 Error, Parameters, and Time.

GoogLeNet on CIFAR-100:

Top 1 Error Top 5 Error Parameters Time(14 epoch)
GoogleNet Naïve 0.36 0.09 65736148 52m38s
GoogleNet 0.34 0.10 6258500 29m8s
GoogleNet Long 0.35 0.10 9641924 36m41s
GoogleNet Short 0.32 0.09 5271652 23m11s
GoogleNet Short2 0.32 0.09 3523556 16m29s
GoogleNet Short3 0.36 0.11 1985220 9m3s
GoogleNet Short4 0.44 0.15 1650084 8m56s

GoogLeNet on CIFAR-10:

Top 1 Error Top 5 Error Parameters Time(14 epoch)
GoogleNet Naïve 0.15 0.01 65291098 52m51s
GoogleNet 0.10 0.00 6166250 28m45s
GoogleNet Long 0.11 0.00 9549674 40m12s
GoogleNet Short 0.10 0.00 5179402 27m30s
GoogleNet Short2 0.10 0.00 3431306 31m57s
GoogleNet Short3 0.11 0.00 1892970 26m31s
GoogleNet Short4 0.15 0.01 1557834 25m30s

First, you can see that the parameter count of the naïve version is indeed about 10× larger, but the accuracy is not dramatically different. Also, except for Short4, almost all variants perform similarly: on CIFAR-100, Top-1 Error is roughly around 0.35 and Top-5 Error around 0.10; on CIFAR-10, Top-1 Error is around 0.10 and Top-5 Error is around 0.00 (nearly zero).

My guess is that CIFAR-100 and CIFAR-10 are not complex enough, so for image classification the depth of the model doesn’t matter that much. In addition to depth, the number of channels also affects accuracy. If the number of channels is large enough, perhaps you don’t need such a deep network. I also tried randomly adding max-pooling layers and dropout layers to GoogLeNet, but the errors were largely the same. This suggests that for image classification, the model itself has fairly high tolerance.

Of course, the conclusions above only apply to simple datasets. On more challenging benchmarks, even a few percentage points of accuracy matter a lot—tiny differences after the decimal point often represent countless engineering ideas and hard work. Still, the most impressive part of GoogLeNet is that it removes a huge number of parameters while achieving nearly the same accuracy!