# Convolutional neural networks

[Convolutional neural networks (CNNs)](https://en.wikipedia.org/wiki/Convolutional_neural_network) are a type of neural network that are particularly well-suited for images, in computer vision tasks including image classification, object detection, and image segmentation. The main idea behind CNNs is to use a _convolutional layer_ to extract features from the image locally. The convolutional layer is typically followed by a _pooling layer_ to reduce the dimensionality. The convolutional and pooling layers are then followed by one or more _fully connected layers_, e.g. to classify the image.

On 30 September 2012, a CNN called [AlexNet](https://en.wikipedia.org/wiki/AlexNet) (click to view the architecture) achieved a top-5 error of 15.3% in the [ImageNet Challenge](https://en.wikipedia.org/wiki/ImageNet#ImageNet_Challenge), more than **10.8 percentage** points lower than that of the runner up. This is considered a breakthrough and has grabbed the attention of increasing number of researchers, practitioners, and the general public. Since then, deep learning has penetrated to many research and application areas. AlexNet contained **eight layers**. In 2015, it was outperformed by a very deep CNN with **over 100 layers** from Microsoft in the ImageNet 2015 contest.

Watch the 14-minute video below for a visual explanation of convolutional neural networks.

```{admonition} Video
<iframe width="700" height="394" src="https://www.youtube.com/embed/HGwBXDKFk9I?start=47&end=862" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

[Explaining main ideas behind convolutional neural networks by StatQuest](https://www.youtube.com/embed/HGwBXDKFk9I?start=47&end=862), embedded according to [YouTube's Terms of Service](https://www.youtube.com/static?gl=CA&template=terms).
```

Remove or comment off the following installation if you have installed PyTorch and TorchVision already.

In [None]:
!pip3 install -q torch torchvision

## Why convolutional neural networks?

In the [previous section](https://pykale.github.io/transparentML/10-deep-cnn-rnn/multilayer-nn.html), we used fully connected neural networks to classify digit images, where the input image needs to be _flattened_ into a vector. There are two major drawbacks of using fully connected neural networks for image classification:

- The number of parameters in the fully connected layer can be very large. For example, if the input image is $28\times 28$ pixels (MNIST), then the number of weights for each hidden unit in the fully connected layer is $28\times 28 = 784$. If the number of hidden units in the fully connected layer is 100, then the number of weight parameters in the fully connected layer is $784\times 100 = 78,400$, for a total of $78,400 + 100 = 78,500$ parameters (there are 100 bias parameters). If we have an input image of a larger size $224\times 224$ pixels, then the total number of parameters in the fully connected layer with 100 hidden units is $224\times 224 \times 100 + 100 = 5,017,700$. This is a lot of parameters to learn and to compute the output once the network is trained. 
- Fully connected neural networks do not make use of the spatial structure of the image. Moreover, a small shift in the position of the image can result in a very different input vector and thus the output of the network can be quite different. This is not desirable for image classification. For image classification, we hope to utilise and preserve the spatial information of the image. This is where convolutional neural networks come in. 

There are two key ideas behind convolutional neural networks:

- **Local connectivity**: The convolutional layer is only connected to a small region of the input. This allows the convolutional layer to learn local features using only a small number of parameters.
- **Weight sharing**: The weights in the convolutional layer are shared across the entire input to detect the same local feature at different locations, across the entire input. This greatly reduces the number of parameters to learn.

Let us see how convolutional neural networks work on an example of image classification adapted from the PyTorch tutorial [Training a classifier](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py) and [the CNN notebook from Lisa Zhang](https://www.cs.toronto.edu/~lczhang/360/lec/w04/convnet.html)

## Load the CIFAR10 image data

Get ready by importing the APIs needed from respective libraries and setting the random seed for reproducibility.

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torchvision import datasets, transforms
import matplotlib.pyplot as plt

%matplotlib inline

torch.manual_seed(2022)
np.random.seed(2022)

It will be good to be aware of the version of PyTorch and TorchVision you are using. The following code will print the version of PyTorch and TorchVision. This notebook is developed using PyTorch 1.13.1 and TorchVision 0.14.1.

In [None]:
torch.__version__

In [None]:
torchvision.__version__

The [CIFAR10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) has ten classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’. The images in CIFAR-10 are of size $3\times 32\times 32$, i.e. 3-channel colour images of $32\times 32$ pixels in size.

As in the case of MNIST, the `torchvision` package has a data loader for CIFAR10 as well. The data loader downloads the data from the internet the first time it is run and stores it in the given root directory. 

Similar to the MNIST example, we apply the `ToTensor` transform to convert the PIL images to tensors. In addition, we also apply the `Normalize` transform to normalise the images with some preferred mean and standard deviation, such as (0.5, 0.5, 0.5) and (0.5, 0.5, 0.5) used below or the mean and standard deviation of the ImageNet dataset (0.485, 0.456, 0.406) and (0.229, 0.224, 0.225) respectively. 

Let us load the train and test sets using a batch size of 8, i.e. each element in the dataloader `train_loader` is a list of 8 images and their corresponding labels. The `num_workers` argument specifies the number of subprocesses to use for data loading. We use 2 subprocesses here for faster data loading.

In [None]:
batch_size = 8
root_dir = "./data"
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]
)

train_dataset = datasets.CIFAR10(
    root=root_dir, train=True, download=True, transform=transform
)
test_dataset = datasets.CIFAR10(
    root=root_dir, train=False, download=True, transform=transform
)

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=batch_size, shuffle=True, num_workers=2
)
test_loader = torch.utils.data.DataLoader(
    test_dataset, batch_size=batch_size, shuffle=False, num_workers=2
)

### Data inspection

Let us examine the dataset a bit.

In [None]:
print("Training set size:", len(train_dataset))
print("Training set shape:", train_dataset.data.shape)
print("Test set size:", len(test_dataset))
print("Classes:", train_dataset.classes)

We can also examine the `train_dataset` object directly.

In [None]:
train_dataset

Also, we can examine the `test_dataset` object similarly.

In [None]:
test_dataset

### Visualise the data

Let us show some of the training images to see what they look like. Here, we define a function `imshow` to show images, which can be reused later.

In [None]:
def imshow(imgs):
    imgs = imgs / 2 + 0.5  # unnormalise back to [0,1]
    plt.imshow(np.transpose(torchvision.utils.make_grid(imgs).numpy(), (1, 2, 0)))
    plt.show()


dataiter = iter(train_loader)
images, labels = next(dataiter)  # get a batch of images
imshow(images)  # show images
print(
    " ".join("%5s" % train_dataset.classes[labels[j]] for j in range(batch_size))
)  # print labels

## Define a convolutional neural network

{numref}`typical-cnn` shows a typical convolutional neural network (CNN) architecture. There are several filter kernels per convolutional layer, resulting in layers of feature maps that each receives the same input but extracts different features due to _different weight matrices_ (to be learnt). Subsampling corresponds to pooling operations that reduces the dimensionality of the feature maps. The last layer is a fully connected layer (also called a _dense_ layer) that performs the classification. 

```{figure} https://upload.wikimedia.org/wikipedia/commons/6/63/Typical_cnn.png
---
height: 200px
name: typical-cnn
---
A typical convolutional neural network (CNN) architecture.
```

Let us look at operations in CNNs in detail.

### Convolution layer with a shared kernel/filter
<center>
<img src="https://www.cs.toronto.edu/~lczhang/360/lec/w04/imgs/math_kernel.png" width="100px" style="margin:0; display:inline">
<img src="https://www.cs.toronto.edu/~lczhang/360/lec/w04/imgs/math_conv.png" width="300px" style="margin:0; display:inline">
</center>

The light blue grid (middle) is the *input* that we are given, e.g. a 5 pixel by 5 pixel greyscale image. The grey grid (left) is a **convolutional kernel/filter** of size $3 \times 3$, containing the *parameters* of this neural network layer.

To compute the output, we superimpose the kernel on a region of the image. Let's start at the top left, in the dark blue region. The small numbers in the bottom right corner of each grid element corresponds to the number in the kernel.
To compute the output at the corresponding location (top left), we "dot" the pixel intensities in the square region with the kernel. That is, we perform the computation:

In [None]:
(3 * 0 + 3 * 1 + 2 * 2) + (0 * 2 + 0 * 2 + 1 * 0) + (3 * 0 + 1 * 1 + 2 * 2)

The green grid (right) contains the *output* of this convolution layer. This output is also called an **output feature map**. The terms **feature**, and **activation** are interchangeable in neural networks. The output value on the top left of the green grid is consistent with the value we obtained by hand in Python.

To compute the next activation value (say, one to the right of the previous output), we will shift the superimposed kernel over by one pixel:

<img src="https://www.cs.toronto.edu/~lczhang/360/lec/w04/imgs/math_conv2.png" width="300px">

The dark blue region is moved to the right by one pixel. We again dot the pixel intensities in this region with the kernel to get another 12, and continues to get 17, $\ldots$, 14. The green grid is updated accordingly.

**Shrinked output**: Here, we did not use **zero padding** (at the edges) so the output of this layers is shrinked by 1 on all sides. If the kernel size is $k=2m+1$, the output will be shrinked by $m$ on all sides so the width and height will be both reduced by $2m$.

### Convolutions with multiple input/output channels

For a colour image, the kernel will be a **3-dimensional tensor**. This kernel will move through the input features just like before, and we "dot" the pixel intensities with the kernel at each region, exactly like before. This "size of the 3rd (colour) dimension" is called the **number of input channels** or **number of input feature maps**.

We also want to detect multiple features, e.g. both horizontal edges and vertical edges. We would want to learn **many** convolutional filters on the same input. That is, we would want to make the same computation above using different kernels, like this:

<img src="https://upload.wikimedia.org/wikipedia/commons/6/68/Conv_layer.png" width="200px">

Each circle on the right of the image represents the output of a different kernel dotted with the highlighted region on the right. So, the output feature is also a 3-dimensional tensor. The size of the new dimension is called the **number of output channels** or **number of output feature maps**. In the picture above, there are 5 output channels.

The `Conv2d` layer expects as input a tensor in the format "NCHW", meaning that the dimensions of the tensor should follow the order:

* batch size
* channel
* height
* width

Let us create a convolutional layer using `nn.Conv2d`:

In [None]:
myconv1 = nn.Conv2d(
    in_channels=3,  # number of input channels
    out_channels=7,  # number of output channels
    kernel_size=5,
)  # size of the kernel

Emulate a batch of 32 colour images, each of size 128x128, like the following:


In [None]:
x = torch.randn(32, 3, 128, 128)
y = myconv1(x)
y.shape

The output tensor is also in the "NCHW" format. We still have 32 images, and 7 channels (consistent with the value of `out_channels` of `Conv2d`), and of size 124x124. If we added the appropriate padding to `Conv2d`, namely `padding` = $m$ (the kernel_size: $2m+1$), then our output width and height should be consistent with the input width and height.

In [None]:
myconv2 = nn.Conv2d(in_channels=3, out_channels=7, kernel_size=5, padding=2)

x = torch.randn(32, 3, 128, 128)
y = myconv2(x)
y.shape

Examine the parameters of `myconv2`:

In [None]:
conv_params = list(myconv2.parameters())
print("len(conv_params):", len(conv_params))
print("Filters:", conv_params[0].shape)  # 7 filters, each of size 3 x 5 x 5
print("Biases:", conv_params[1].shape)

### Pooling layers for subsampling

A pooling layer can be created like this: 
<img src="https://upload.wikimedia.org/wikipedia/commons/e/e9/Max_pooling.png" width="300px">

In [None]:
mypool = nn.MaxPool2d(kernel_size=2, stride=2)
y = myconv2(x)
z = mypool(y)
z.shape

Usually, the kernel size and the stride length will be equal so each pixel is pooled only once. 
The pooling layer has **no trainable parameters**:

In [None]:
list(mypool.parameters())

### Define a CNN class

Now we define a CNN class consisting of several layers as defined below (from the official the Pytorch tutorial).

In [None]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(
            3, 6, 5
        )  # 3=#input channels; 6=#output channels; 5=kernel size
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


myCNN = CNN()

Here, `__init__()` defines the layers.  `forward()` defines the *forward pass* that transforms the input to the output. `backward()` is automatically defined using `autograd`. `relu()` is the [rectified linear unit](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) **activation function** that performs a *nonlinear* transformation/mapping of an input variable (element-wise operation). `Conv2d()` defines a convolution layer, as shown below where blue maps indicate inputs, and cyan maps indicate outputs.

<table>
    <tr>
    <td  style="text-align: left"> Convolution with no padding, no strides.      <img src="https://raw.githubusercontent.com/vdumoulin/conv_arithmetic/master/gif/no_padding_no_strides.gif" alt="Drawing" style="width: 250px;"/> </td>
</tr>
</table>

More convolution layers are illustrated nicely at [Convolution arithmetic](https://github.com/vdumoulin/conv_arithmetic) (click to explore). 

As defined above, this network `CNN()` has **two** convolutional layers: `conv1` and `conv2`.

- The first convolutional layer `conv1` requires an input with 3 channels, outputs **6 channels**, and has a kernel size of $5\times 5$. We are not adding any zero-padding.
- The second convolutional layer `conv2` requires an input with **6 channels** (note this **MUST match the output channel number of the previous layer**),  outputs 16 channels, and has a kernel size of (again) $5\times 5$. We are not adding any zero-padding.

In the `forward` function, we see that the convolution operations are always followed by the usual ReLU activation function, and a pooling operation. The pooling operation used is max pooling, so each pooling operation
**reduces the width and height of the neurons in the layer by half**.

Because we are not adding any zero padding, we end up with $16\times 5\times 5$ hidden units
after the second convolutional layer (`16` matches the output channel number of `conv2`, $5\times 5$ is based on the input dimension $32\times 32$, see below). These units are then passed to two fully-connected layers, with the usual ReLU activation in between.

Notice that the number of channels **grew** in later convolutional layers! However, the number of hidden units in each layer is still reduced because of the convolution and pooling operation:

* Initial Image Size: $3 \times 32 \times 32 $
* After `conv1`: $6 \times 28 \times 28$ ($32 \times 32$ is reduced by `2` on each side)
* After Pooling: $6 \times 14 \times 14 $ (image size halved)
* After `conv2`: $16 \times 10 \times 10$ ($14 \times 14$ is reduced by `2` on each side)
* After Pooling: $16 \times 5 \times 5 $ (halved)
* After `fc1`: $120$
* After `fc2`: $84$
* After `fc3`: $10$ (**= number of classes**)

This pattern of **doubling the number of channels with every pooling / strided convolution** is common in modern convolutional architectures. It is used to avoid loss of too much information within a single reduction in resolution.

### Inspect the CNN architecture

Now let's take a look at the CNN built. 

In [None]:
print(myCNN)

Let us check the (randomly initialised) parameters of this NN. Below, we check the first 2D convolution. 

In [None]:
params = list(myCNN.parameters())
print(len(params))
print(params[0].size())  # First Conv2d's .weight
print(params[1].size())  # First Conv2d's .bias
print(params[1])

In the above, we only printed the bias values. The weight values are printed below.

In [None]:
print(params[0])

To learn more about these functions, refer to the [`torch.nn` documentation](https://pytorch.org/docs/stable/nn.html) (search for the function, e.g., search for `torch.nn.ReLU` and you will find [its documentation](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html#torch.nn.ReLU).

## Optimisation, training and testing

### Choose a criterion and an optimiser

Here, we choose the cross-entropy loss as the criterion and the stochastic gradient descent (SGD) with momentum as the optimiser.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(myCNN.parameters(), lr=0.001, momentum=0.9)

### Train the network

Next, we will feed data to our CNN to train it, i.e. learn its parameters so that the criterion above (cross-entropy loss) is minimised, using the SGD optimiser. The dataset is loaded in batches to train the model. One `epoch` means one cycle through the full training dataset.  The steps are 
* Define the optimisation criterion and optimisation method.
* Iterate through the whole dataset in batches, for a number of `epochs` till a maximum specified or a convergence criteria (e.g., successive change of loss < 0.000001)
* In each batch processing, we 
    * do a forward pass
    * compute the loss
    * backpropagate the loss via `autograd`
    * update the parameters

Now, we loop over our data iterator, and feed the inputs to the network and optimize. Here, we set `max_epochs` to 3 for quick testing. In practice, more epochs typically lead to better performance. 

In [None]:
max_epochs = 3
for epoch in range(max_epochs):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(train_loader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = myCNN(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:  # print every 2000 mini-batches
            print("[%d, %5d] loss: %.3f" % (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print("Finished Training!")

Take a look at how `autograd` keeps track of the gradients for back propagation.

In [None]:
print(loss.grad_fn)
print(loss.grad_fn.next_functions[0][0])

### Save our trained model:

In [None]:
PATH = root_dir + "/cifar_net.pth"
torch.save(myCNN.state_dict(), PATH)

See [more details on saving PyTorch models](https://pytorch.org/docs/stable/notes/serialization.html).

### Test the network on the test data

We will test the trained network by predicting the class label that the neural network outputs, and checking it against the ground-truth. 

Firstly, let us show some images from the test set and their ground-truth labels.

In [None]:
dataiter = iter(test_loader)
images, labels = next(dataiter)

# print images
imshow(torchvision.utils.make_grid(images))
print(
    "GroundTruth: ",
    " ".join("%5s" % train_dataset.classes[labels[j]] for j in range(batch_size)),
)

Next, load back in our saved model (note: saving and re-loading wasn't necessary here, we only did it for illustration):

In [None]:
loadCNN = CNN()
loadCNN.load_state_dict(torch.load(PATH))

Now, let us see what the neural network thinks these examples above are:

In [None]:
outputs = loadCNN(images)

The outputs are energies for the 10 classes. The higher the energy for a class, the more the network thinks that the image is of the particular class. Thus, let us find the index of the highest energy to get the predicted class.

In [None]:
_, predicted = torch.max(outputs, 1)

print(
    "Predicted: ",
    " ".join("%5s" % train_dataset.classes[predicted[j]] for j in range(batch_size)),
)

We should get at least half correct.

Let us look at how the network performs on the whole dataset.

In [None]:
correct = 0
total = 0
with torch.no_grad():  # testing phase, no need to compute the gradients to save time
    for data in test_loader:
        images, labels = data
        outputs = loadCNN(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(
    "Accuracy of the network on the 10000 test images: %d %%" % (100 * correct / total)
)

We should get something above 50%, which is much better than random guessing.

Let us examine what are the classes that performed well, and the classes that did not perform well:

In [None]:
class_correct = list(0.0 for i in range(10))
class_total = list(0.0 for i in range(10))
with torch.no_grad():
    for data in test_loader:
        images, labels = data
        outputs = loadCNN(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(batch_size):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1

for i in range(10):
    print(
        "Accuracy of %5s : %2d %%"
        % (train_dataset.classes[i], 100 * class_correct[i] / class_total[i])
    )

We can see that the network performs well on some classes but poorly on others, noting that we only trained it for 3 epochs.

## Exercises

**1.** Suppose we have a **fully connected neural network (multilayer perceptron)** with $3$ inputs and $2$ outputs. In between, we have **three hidden layers**, i.e., Hidden Layer $1$ **($4$ neurons)** after the input layer, Hidden Layer $2$ **($6$ neurons)** after Hidden Layer $1$, and Hidden Layer $3$ **($5$ neurons)** after Hidden Layer $2$, with full connections between all adjacent layers and no other connections. The activation function **sigma (sigmoid)** is used in the hidden layers. How many **learnable parameters** in total are there for this **three-hidden-layer** neural network?

*Compare your answer with the solution below*

```{toggle}

Firstly we must count all of the weights which connect the layers of our model,

Number of weights = $(3 × 4) + (4 × 6) + (6 × 5) + (5 × 2) = 76$

Next, we count all of the bias parameters,

Number of biases = $4 + 6 + 5 + 2 = 17$.

The sum of these two values is the total number of model parameters, therefore the
answer is $76 + 17 = 93$.

```

**2.** We have a $512 × 512 × 3$ colour image. We apply $100$ $5 × 5$ **filters** with **stride** $7$ and **pad** $2$ to obtain a **convolution output**. What is the **output volume size**? How many **parameters** are needed for such a layer?

*Compare your answer with the solution below*

```{toggle}

**Size of output:**

Size of output: $(Image Length - Filter Size + 2× Padding) / Stride + 1$

Image Length = $512$

Filter Size = $5$

Stride = $7$

Padding = $2$

After applying the first $5 × 5$ filter:

Output Size After First Filter = $(512 − 5 + 2 × 2)/7 + 1 = 74$

Final Output Shape = Number of Filters × Output Size × Output Size

Final Output Shape = $100 × 74 × 74$


**Number of parameters:**

Number of parameters = $(Filter Width × Filter Height × Filters in Previous Layer +1) × Number of Filters$

Filter Width = $5$

Filter Height = $5$

Filters in Previous Layer = $3$

Number of Filters = $100$

Number of parameters = $(5 × 5 × 3 + 1) × 100 = 7600$


```

**3.**  **OCTMNIST** is based on an existing dataset of 109,309 valid optical coherence tomography (OCT) images for retinal diseases, with 4 different types, leading to a multi-class classification task. The source training set is split with a ratio of 9 : 1 into training and validation sets, and uses its source validation set as the test set.

**Note:** The paragraph above describes how the authors construct OCTMNIST from the source dataset, provided as background information. You do not have to use this information to complete the follwing exercises. OCTMNIST has fixed training, validation, and test sets with respective APIs so you just need to use the provided API and splits in OCTMNIST. 


Follow the instructions at [https://github.com/MedMNIST/MedMNIST](https://github.com/MedMNIST/MedMNIST) to download and load the data. Use a similar method to the one you used in **Exercise 1** in Section [10.1.8](https://pykale.github.io/transparentML/10-deep-cnn-rnn/convolutional-nn.html#exercises) to fetch the data. Again, use the `torchvision` package to compose a transformation to convert the data to tensors and normalise it (although this time, don't flatten the data!). In your training, validation and testing dataloaders, use a batch size of 256.

In [None]:
# Install medmnist
!python -m pip install medmnist

In [None]:
# Write your code below to answer the question

*Compare your answer with the reference solution below*

In [None]:
# Imports
import numpy as np
import time

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils import data as torch_data

# For visualising data
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import medmnist

from medmnist import INFO

SEED = 1234
torch.manual_seed(SEED)
np.random.seed(SEED)

DS_INFO = INFO["octmnist"]
data_class = getattr(medmnist.dataset, DS_INFO["python_class"])

# We need to download and normalise the data. ToTensor() transforms images from 0-255 to 0-1, and Normalize() centres the data around 0, between -1 to 1.
transform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5), (0.5)),  # Normalize the image data
    ]
)


train_dataset = data_class(split="train", download=True, transform=transform)
val_dataset = data_class(split="val", download=True, transform=transform)
test_dataset = data_class(split="test", download=True, transform=transform)

# First, lets make our data loader. We need to pick a batch size.
batch_size = 256
train_loader = torch_data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = torch_data.DataLoader(val_dataset, batch_size=batch_size, shuffle=True)
test_loader = torch_data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)

**4.** Display at least **ten images** for **each class**, i.e. at least $40$ images, from the training set loaded in **Exercise 3**. 

In [None]:
# Write your code below to answer the question

*Compare your answer with the reference solution below*

In [None]:
# Function to display the images from the dataset given a class
def display_samples(data, labels, count=10):
    """
    Display 'count' images from the dataset 'data' with label from list of labels 'labels'
    '"""

    fig, ax = plt.subplots(len(labels), count, figsize=(4 * count, 16))

    for label in labels:
        data_with_label = data.imgs[data.labels[:, 0] == label][:count]
        for ex_idx in range(len(data_with_label)):
            ax[label, ex_idx].imshow(data_with_label[ex_idx])

            # Turn off x,y ticks
            ax[label, ex_idx].set_yticks([])
            ax[label, ex_idx].set_xticks([])

        # Set the y axis label
        ax[label, 0].set_ylabel(
            ylabel=DS_INFO["label"][str(label)].split(" ")[0], fontsize=30
        )

    plt.show()


display_samples(train_dataset, [0, 1, 2, 3], 10)

**5.** This question asks you to design convolutional neural networks (CNNs). Only the number of convolutional (Conv) layers and the number of fully connected (FC) layers will be specified below. You are **free to design** other aspects of the network. For example, you can use other types of **operation (e.g. padding)**, **layers (e.g. pooling, or preprocessing (e.g. augmentation)**, and you will need to choose the number of **units/neurons in each layer**. Likewise, you may need to customise the **number of epochs** and many other settings according to your accessible computational power.

**(a)** Design a CNN with **two Conv layers** and **two FC layers**. Train the model on the training set loaded in **Exercise 3**, and evaluate the trained model on the test set loaded in **Exercise 3** using the **accuracy** metric.


In [None]:
# Write your code below to answer the question

*Compare your answer with the reference solution below*

In [None]:
import torch.nn.functional as F

torch.manual_seed(SEED)
np.random.seed(SEED)

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")


# First CNN model with 2 convolutional layer and 2 fully connected layer
class CNN_1(nn.Module):
    def __init__(self):
        super(CNN_1, self).__init__()
        self.conv1 = nn.Conv2d(1, 4, 5)  # 4X24X24
        self.pool1 = nn.MaxPool2d(2, 2)  # 4X12X12
        self.conv2 = nn.Conv2d(4, 8, 5)  # 8X8X8
        self.pool2 = nn.MaxPool2d(2, 2)  # 8X4X4

        self.fc1 = nn.Linear(8 * 4 * 4, 80)
        self.fc2 = nn.Linear(80, 4)

    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))  # applying pooling to 1st conv
        x = self.pool2(F.relu(self.conv2(x)))  # applying pooling to the 2nd conv
        x = x.view(-1, 8 * 4 * 4)  # connecting conv with fc
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)


model = CNN_1().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = nn.CrossEntropyLoss()


def train(epoch):
    model.train()  # Set model to training mode

    # Loop over each batch from the training set
    for batch_idx, (data, target) in enumerate(train_loader):
        # Copy data to GPU if needed
        data = data.to(device)
        target = target.to(device)

        optimizer.zero_grad()  # Zero gradient buffers
        output = model(data)  # Pass data through the network
        loss = criterion(output, torch.max(target, 1)[1])  # Calculate loss
        loss.backward()  # Backpropagate
        optimizer.step()  # Update weights

    return print("Train Epoch: {} \tLoss: {:.6f}".format(epoch, loss.data.item()))


def test(loss_vector, accuracy_vector):
    model.eval()  # Set model to evaluation mode
    test_loss, correct, total = 0, 0, 0
    acc = []
    for data, target in test_loader:
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        test_loss += criterion(output, torch.max(target, 1)[1]).data.item()

        _, preds = torch.max(output, dim=1)

        acc.append(torch.tensor(torch.sum(preds == target).item() / len(preds)))

    test_loss /= len(test_loader)
    loss_vector.append(test_loss)

    accuracy = float(sum(acc) / len(acc))
    acc.clear()

    print(
        "\nValidation set: Average loss: {:.5f}, Accuracy: ({:.2f}%)\n".format(
            test_loss, accuracy
        )
    )


epochs = 1

loss_test, acc_test = [], []
for epoch in range(1, epochs + 1):
    train(epoch)
    test(loss_test, acc_test)

**(b)** Design a CNN with **three Conv layers** and **three FC layers**. Train the model on the training set, and evaluate the trained model on the test set using the accuracy metric.

In [None]:
# Write your code below to answer the question

*Compare your answer with the reference solution below*

In [None]:
# Initializaing second CNN model with 3 convolutional layer and 3 FC layer
class CNN_2(nn.Module):
    def __init__(self):
        super(CNN_2, self).__init__()
        self.conv1 = nn.Conv2d(1, 4, 5)  # 4X24X24
        self.pool1 = nn.MaxPool2d(2, 2)  # 4X12X12
        self.conv2 = nn.Conv2d(
            4, 8, 3, padding=1
        )  # 8X12X12(As here we have used padding=1 we have to add +2p where p is the padding thats why the +2+1=3 minused with kernal size 3 and the dimensioms remains same 12X12)
        self.pool2 = nn.MaxPool2d(2, 2)  # 8X6X6
        self.conv3 = nn.Conv2d(8, 16, 3)  # 16X4X4
        self.pool3 = nn.MaxPool2d(2, 2)  # 16X2X2

        self.fc1 = nn.Linear(16 * 2 * 2, 120)
        self.fc2 = nn.Linear(120, 80)
        self.fc3 = nn.Linear(80, 4)

    def forward(self, x):
        x = self.pool1(F.relu(self.conv1(x)))
        x = self.pool2(F.relu(self.conv2(x)))
        x = self.pool3(F.relu(self.conv3(x)))
        x = x.view(-1, 16 * 2 * 2)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        output = F.log_softmax(x, dim=1)
        return output


model = CNN_2().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)
criterion = nn.CrossEntropyLoss()

epochs = 1

loss_test, acc_test = [], []
for epoch in range(1, epochs + 1):
    train(epoch)
    test(loss_test, acc_test)