03-Neural-Nets-and-CNNs.ipynb 56.2 KB
Newer Older
Ben Glocker's avatar
Ben Glocker committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Tutorial 3: Neural Networks & CNNs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running on Colab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! wget https://www.doc.ic.ac.uk/~bglocker/teaching/notebooks/supervised-data.zip\n",
    "! unzip supervised-data.zip\n",
    "\n",
    "# data directory\n",
    "data_dir = 'data/mnist/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running on DoC lab machines"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# data directory\n",
    "data_dir = '/vol/lab/course/416/data/mnist/'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Aims of this tutorial**:\n",
    "- Introduce you PyTorch, a library made for working with neural networks.\n",
    "- Implement Multi-Layer Perceptrons (MLP), starting from linear regression.\n",
    "- Implement a small Convolutional Neural Network (CNN).\n",
    "Along the way you will implement forward and back-propagation as well as stochastic gradient descent.\n",
    "\n",
    "It may look long, but it should be easy to complete. Understanding and familiarizing yourselves with the core points of this tutorial and the PyTorch library is of **very high importance in order to be able to follow the next tutorials and the coursework**. Invest some time to study and understand it, and don't hesitate to ask if you don't understand something.\n",
    "\n",
    "**Prerequisites**:\n",
    "- Familiar with python and numpy\n",
    "- Familiar with logistic regression and MNIST\n",
    "\n",
    "\n",
    "**Notes**:\n",
    "- Docs for PyTorch's functions you will need:  \n",
    "https://pytorch.org/docs/stable/tensors.html  \n",
    "https://pytorch.org/docs/stable/nn.html  \n",
    "- Some helper functions are included below for loading and plotting data. They will be used out of the box below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminary: Loading and refreshing MNIST\n",
    "\n",
    "We will be using MNIST data again in this tutorial. Because the images are small, the database allows small networks to be quickly trained using CPU. Anything larger afterwards will require GPUs.\n",
    "\n",
    "Important point to understand is the structure of the loaded data. Especially the **shape** of the loaded numpy arrays, because we need to manipulate it carefully, when processing it with neural networks.\n",
    "\n",
    "Lets load and inspect the data..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torchvision.datasets as dset\n",
    "\n",
    "\n",
    "def get_mnist(data_dir, train):\n",
    "    # data_dir: path to local directory where data is, or should be stored.\n",
    "    # train: if True, return training data. If False, return test data.\n",
    "    # download: if data not in data_dir, download it.\n",
    "    data_set = dset.MNIST(root=data_dir, train=train, transform=None, download=True)\n",
    "    # Students should only deal with numpy arrays, so that it's easy to follow.\n",
    "    if train:\n",
    "        data_x = np.asarray(data_set.train_data, dtype='uint8')\n",
    "        data_y = np.asarray(data_set.train_labels, dtype='int16') # int64 by default\n",
    "    else:\n",
    "        data_x = np.asarray(data_set.test_data, dtype='uint8')\n",
    "        data_y = np.asarray(data_set.test_labels, dtype='int16') # int64 by default\n",
    "    return data_x, data_y\n",
    "\n",
    "\n",
    "def make_lbls_onehot(lbls, num_classes):\n",
    "    # lbls: np.array of shape [N]\n",
    "    lbls_onehot = np.zeros(shape=(lbls.shape[0], num_classes ) )\n",
    "    lbls_onehot[ np.arange(lbls_onehot.shape[0]), lbls ] = 1\n",
    "    return lbls_onehot\n",
    "\n",
    "\n",
    "# If datasets are not at specified location, they will be downloaded.\n",
    "train_imgs, train_lbls = get_mnist(data_dir=data_dir, train=True)\n",
    "test_imgs, test_lbls = get_mnist(data_dir=data_dir, train=False)\n",
    "\n",
    "print(\"[train_imgs] Type: \", type(train_imgs), \"|| Shape:\", train_imgs.shape, \"|| Data type: \", train_imgs.dtype )\n",
    "print(\"[train_lbls] Type: \", type(train_lbls), \"|| Shape:\", train_lbls.shape, \"|| Data type: \", train_lbls.dtype )\n",
    "print('Class labels in train = ', np.unique(train_lbls))\n",
    "\n",
    "print(\"[test_imgs] Type: \", type(test_imgs), \"|| Shape:\", test_imgs.shape, \" || Data type: \", test_imgs.dtype )\n",
    "print(\"[test_lbls] Type: \", type(test_lbls), \"|| Shape:\", test_lbls.shape, \" || Data type: \", test_lbls.dtype )\n",
    "print('Class labels in test = ', np.unique(test_lbls))\n",
    "\n",
    "S_tr_samples = train_imgs.shape[0] # S hereafter. Number of training samples in database.\n",
    "H_height = train_imgs.shape[1] # H hereafter\n",
    "W_width = train_imgs.shape[2] # W hereafter\n",
    "C_classes = len(np.unique(train_lbls)) # C hereafter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above we see that data have been loaded in *numpy arrays*.    \n",
    "Arrays with images have **shape ( S = number of samples, H = height, W = width )**.  \n",
    "Arrays with labels have **shape ( S = number of samples)**, holding one integer per image, the digit's class.\n",
    "\n",
    "MNIST comprises of a **train set (S_tr = 60000) images** and a **test set (S_te = 10000) images**.  \n",
    "Model development uses the train set. Finally, a model is evaluated on the test set.\n",
    "\n",
    "Lets plot a few of them in one collage to have a better look..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "\n",
    "\n",
    "def plot_image(image, interpol=\"nearest\", cmap=\"gray\", vminmax=[None,None], figsize=None):\n",
    "    # image: np.array of one of the following shapes:\n",
    "    #       grayscale image:    (height, width)\n",
    "    #       grayscale image:    (height, width, 1)\n",
    "    #       rgb image:          (height, width, 3)\n",
    "    print(\"Plotting image of shape: \", image.shape)\n",
    "    plt.figure(figsize=figsize) # (figsize=(n_imgs_per_row*0.5, n_rows*0.5)) # size (width, height), in inches.\n",
    "    if len(image.shape) == 2:\n",
    "        fig = plt.imshow(image, cmap=cmap, interpolation=interpol, vmin=vminmax[0], vmax=vminmax[1]) # imshow: (w,h) or (w,h,3)\n",
    "        plt.colorbar(fig)\n",
    "    elif len(image.shape) == 3 and image.shape[2] == 1:\n",
    "        fig = plt.imshow(image[:,:,0], cmap=cmap, interpolation=interpol, vmin=v_minmax[0], vmax=vminmax[1]) # imshow: (w,h) or (w,h,3)\n",
    "        plt.colorbar(fig)\n",
    "    elif len(image.shape) == 3 and image.shape[2] == 3 :\n",
    "        _ = plt.imshow(image, interpolation=interpol)\n",
    "    else:\n",
    "        raise Error(\"Wrong shape of given image for plotting.\")\n",
    "\n",
    "\n",
    "def plot_grid_of_images(imgs, n_imgs_per_row=10, interpol=\"nearest\", cmap=\"gray\", vminmax=[None,None]):\n",
    "    # imgs: numpy array of one of the following shapes:\n",
    "    #       grayscales images:  (number-of-images, height, width)\n",
    "    #       grayscales images:  (number-of-images, height, width, 1)\n",
    "    #       color images:       (number-of-images, height, width, 3)\n",
    "    n_rows = imgs.shape[0]//n_imgs_per_row + 1*int(imgs.shape[0]%n_imgs_per_row > 0)\n",
    "    print(\"n_rows=\",n_rows)\n",
    "    # Append empty images if the last row is not complete\n",
    "    n_empty_imgs = n_rows * n_imgs_per_row - imgs.shape[0]\n",
    "    imgs_to_plot = np.concatenate( [imgs, np.zeros((n_empty_imgs, imgs.shape[1], imgs.shape[2]))], axis=0)\n",
    "    \n",
    "    # draw row by row\n",
    "    row_images = [] # each element will be (image-height, image-width X n_imgs_per_row)\n",
    "    for current_row in range(n_rows):\n",
    "        tmp_row_images = imgs_to_plot[current_row * n_imgs_per_row : (current_row + 1) * n_imgs_per_row]\n",
    "        row_images.append( np.concatenate(tmp_row_images, axis=1) )\n",
    "    # draw all row-images in one image\n",
    "    collage_of_images = np.concatenate(row_images, axis=0) # array.shape: (height X n_imgs_per_row, width X n_imgs_per_row)\n",
    "    \n",
    "    plot_image(collage_of_images, interpol=interpol, cmap=cmap, vminmax=[None,None])\n",
    "\n",
    "\n",
    "plot_grid_of_images(train_imgs[0:100], n_imgs_per_row=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the intensities in the images take **values from 0 to 255**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preliminary: Data pre-processing\n",
    "\n",
    "A first step in almost all pipelines is to pre-process the data, to make them more appropriate for a model.\n",
    "\n",
    "Below, we will perform 3 points:  \n",
    "a) Change the labels from an integer representation to a **one-hot representation** of the **C=10 classes**.\n",
    "b) Normalize the **intensities** in the images (from 0-255) to have a **zero-mean** and **unary standard deviation**.\n",
    "c) **Vectorise the 2D images into 1D vectors for the MLP**, which only gets vector-input per sample. The CNN later will use the original 2D images.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a) Change representation of labels to one-hot vectors of length C=10.\n",
    "train_lbls_onehot = np.zeros(shape=(train_lbls.shape[0], C_classes ) )\n",
    "train_lbls_onehot[ np.arange(train_lbls_onehot.shape[0]), train_lbls ] = 1\n",
    "test_lbls_onehot = np.zeros(shape=(test_lbls.shape[0], C_classes ) )\n",
    "test_lbls_onehot[ np.arange(test_lbls_onehot.shape[0]), test_lbls ] = 1\n",
    "print(\"BEFORE: [train_lbls]        Type: \", type(train_lbls), \"|| Shape:\", train_lbls.shape, \" || Data type: \", train_lbls.dtype )\n",
    "print(\"AFTER : [train_lbls_onehot] Type: \", type(train_lbls_onehot), \"|| Shape:\", train_lbls_onehot.shape, \" || Data type: \", train_lbls_onehot.dtype )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# b) Normalize image intensities of the database to have mean=0 and std=1 across.\n",
    "# This commonly facilitates learning:\n",
    "# - A zero-centered signal with small magnitude allows avoiding exploding/vanishing problems easier.\n",
    "# - Features of arbitrary & heterogeneous ranges are harder to learn how to model.\n",
    "def normalize_int_whole_database(data):\n",
    "    # data: shape [num_samples, H, W, C]\n",
    "    mu = np.mean(data, axis=(0,1,2), keepdims=True) # Mean int of channel C, over samples and pixels.\n",
    "    std = np.std(data, axis=(0,1,2), keepdims=True) # Returned shape: [1, 1, 1, C]\n",
    "    norm_data = (data - mu) / std\n",
    "    return norm_data\n",
    "\n",
    "train_imgs = normalize_int_whole_database(train_imgs)\n",
    "test_imgs = normalize_int_whole_database(test_imgs)\n",
    "\n",
    "# Lets plot one image.\n",
    "index = 0 # Try any, up to 60000\n",
    "print(\"Plotting image of index: [\", index, \"]\")\n",
    "print(\"Class label for this image is: \", train_lbls[index])\n",
    "print(\"One-hot label representation: [\", train_lbls_onehot[index], \"]\")\n",
    "plot_image(train_imgs[index])\n",
    "# Notice the magnitude of intensities. Black is now negative and white is positive float."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# c) Flatten the images, from 2D matrices to 1D vectors. MLPs take feature-vectors as input, not 2D images.\n",
    "train_imgs_flat = train_imgs.reshape([train_imgs.shape[0], -1]) # Preserve 1st dim (N = num samples), flatten others.\n",
    "test_imgs_flat = test_imgs.reshape([test_imgs.shape[0], -1])\n",
    "print(\"Shape of numpy array holding the training database:\")\n",
    "print(\"Original : [S, H, W] = [\", train_imgs.shape , \"]\")\n",
    "print(\"Flattened: [S, H*W]  = [\", train_imgs_flat.shape , \"]\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 1: Stochastic Gradient Descent for Training Neural Nets\n",
    "\n",
    "Below you are given the main training function, gradient_descent. This will be called by all following parts of the tutorial.  \n",
    "The function takes a model and data and performs an iteration of gradient descent. Every few steps, it will also predict the testing data and report accuracy.\n",
    "\n",
    "If the training database is large, processing the whole database per iteration can be very slow. An efficient alternative is **stochastic** gradient descent, where a training batch is randomly sampled per iteration.\n",
    "\n",
    "In the below, change the code to make gradient descent stochastic, by sampling a **random** batch per iteration instead of constantly the same training samples.\n",
    "\n",
    "(*Task 1 can be completed independently from Task 2. Try both before and after, to observe the different behaviour.*)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_train_progress(loss_l, acc_train_l, acc_test_l, iters_per_point, total_iters=None):\n",
    "\n",
    "    fig, axes = plt.subplots(1, 2, sharex=False, sharey=False)\n",
    "    assert len(loss_l) == len(acc_train_l) == len(acc_test_l)\n",
    "    x_points = range(0, len(loss_l)*iters_per_point, iters_per_point)\n",
    "    \n",
    "    axes[0].plot(x_points, loss_l, color=\"black\", label=\"Training loss\", linewidth=5)\n",
    "    axes[0].set_title(\"Training loss\", fontsize=10, y=1.022)\n",
    "    axes[0].yaxis.grid(True, zorder=0)\n",
    "    axes[0].set_xlabel('Iteration', fontsize=10)\n",
    "    if total_iters is not None:\n",
    "        axes[0].set_xlim([0,total_iters])\n",
    "    axes[0].set_ylim([0,None])\n",
    "    axes[0].legend(loc='upper right')\n",
    "    \n",
    "    axes[1].set_title(\"Accuracy\", fontsize=10, y=1.022)\n",
    "    axes[1].plot(x_points, acc_train_l, color=\"blue\", label=\"Train\", linewidth=5)\n",
    "    axes[1].plot(x_points, acc_test_l, color=\"red\", label=\"Test\", linewidth=5)\n",
    "    axes[1].yaxis.grid(True, zorder=0)\n",
    "    axes[1].set_xlabel('Iteration', fontsize=10)\n",
    "    if total_iters is not None:\n",
    "        axes[1].set_xlim([0,total_iters])\n",
    "    axes[1].set_ylim([0,100])\n",
    "    axes[1].legend(loc='lower right')\n",
    "    \n",
    "    plt.show()\n",
    "    \n",
    "    \n",
    "def get_random_batch(train_imgs, train_lbls, batch_size, rng):\n",
    "    # train_imgs: Images for training. Numpy array of shape [S, H, W]\n",
    "    # train_lbls: Labels of the training images. Numpy array [S], one integer for each of S samples.\n",
    "    # batch_size: integer. Size that the batch should have.\n",
    "    \n",
    "    ####### TODO: Sample a random batch of images for training. Fill in the blanks (???) ######### \n",
    "    #indices = rng.randint(low=??????, high=train_imgs.shape[???????], size=?????????, dtype='int32')\n",
    "    indices = range(0, batch_size)\n",
    "    ##############################################################################################\n",
    "    \n",
    "    train_imgs_batch = train_imgs[indices]\n",
    "    train_lbls_batch = train_lbls[indices]\n",
    "    return [train_imgs_batch, train_lbls_batch]\n",
    "\n",
    "def gradient_descent(net, loss_func, update_params_func, rng,\n",
    "                     train_imgs, train_lbls, test_imgs, test_lbls,\n",
    "                     batch_size, learning_rate, total_iters, iters_per_test=-1 ):\n",
    "    # net: Instance of a model. See classes: MLP_Numpy, MLP_Torch, MLP_Torch_Autograd, CNN_Torch_Autograd.\n",
    "    # loss_func: Function that computes the loss. See functions: cross_entropy_numpy/torch.\n",
    "    # update_params_func: Function performing SGD parameter updates. See: grad_descent_update_numpy/torch/autograd\n",
    "    # rng: numpy random number generator\n",
    "    # train_imgs: The training images. Numpy array, shape [S_tr, H, W]\n",
    "    # test_imgs: Save as above, for testing images. [S_te, H, W]\n",
    "    # train_lbls: One hot representation of labels corresponding to train_imgs. Numpy array, shape [S_tr, C]\n",
    "    # test_lbls: As above, but for testing data. [S_te, C]\n",
    "    # batch_size: Size N of the batch that should be processed per SGD iteration by a model.\n",
    "    # learning_rate: self explanatory.\n",
    "    # total_iters: how many iterations in total to perform.\n",
    "    # iters_per_test: Integer. Every that many iterations the model predicts the test data and accuracy is reported.\n",
    "    values_to_plot = {'loss':[], 'acc_train': [], 'acc_test': []}\n",
    "    \n",
    "    for t in range(total_iters):\n",
    "        # Sample batch for this SGD iteration\n",
    "        train_imgs_batch, train_lbls_batch = get_random_batch(train_imgs, train_lbls, batch_size, rng)\n",
    "        \n",
    "        # Forward pass\n",
    "        y_pred = net.forward_pass(train_imgs_batch)\n",
    "\n",
    "        # Compute loss: Cross Entropy\n",
    "        y_real = train_lbls_batch\n",
    "        loss = loss_func(y_pred, y_real)\n",
    "\n",
    "        # Backwards pass. Compute gradients.\n",
    "        grads = net.backward_pass(loss, y_real)\n",
    "\n",
    "        # Update weights with gradient descent\n",
    "        update_params_func(net.params, grads, learning_rate=learning_rate)\n",
    "        \n",
    "        \n",
    "        # ==== Report training loss and accuracy ======\n",
    "        # y_pred and loss can be either np.array, or torch.tensor (see later). If tensor, make it np.array.\n",
    "        y_pred_numpy = y_pred if type(y_pred) is np.ndarray else y_pred.detach().numpy()\n",
    "        y_pred_lbls = np.argmax(y_pred_numpy, axis=1) # y_pred is soft/probability. Make it a hard one-hot label.\n",
    "        y_real_lbls = np.argmax(y_real, axis=1)\n",
    "        \n",
    "        acc_train = np.mean(y_pred_lbls == y_real_lbls) * 100. # percentage\n",
    "        \n",
    "        loss_numpy = loss if type(loss) is type(float) else loss.item()\n",
    "        print(\"[iter:\", t, \"]: Training Loss: {0:.2f}\".format(loss), \"\\t Accuracy: {0:.2f}\".format(acc_train))\n",
    "        \n",
    "        # =============== Every few iterations, predict the testing-database ================#\n",
    "        if t==total_iters-1 or t%iters_per_test == 0:\n",
    "            y_pred_test = net.forward_pass(test_imgs)\n",
    "            # ==== Report test accuracy ======\n",
    "            y_pred_test_numpy = y_pred_test if type(y_pred_test) is np.ndarray else y_pred_test.detach().numpy()\n",
    "            y_pred_lbls_test = np.argmax(y_pred_test_numpy, axis=1)\n",
    "            y_real_lbls_test = np.argmax(test_lbls, axis=1)\n",
    "            acc_test = np.mean(y_pred_lbls_test == y_real_lbls_test) * 100.\n",
    "            print(\"\\t\\t\\t\\t\\t\\t\\t\\t Testing Accuracy: {0:.2f}\".format(acc_test))\n",
    "            \n",
    "            # Keep list of metrics to plot progress.\n",
    "            values_to_plot['loss'].append(loss_numpy)\n",
    "            values_to_plot['acc_train'].append(acc_train)\n",
    "            values_to_plot['acc_test'].append(acc_test)\n",
    "    # In the end of the process, plot loss accuracy on training and testing data.\n",
    "    plot_train_progress(values_to_plot['loss'], values_to_plot['acc_train'], values_to_plot['acc_test'], iters_per_test)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 2: Classification with Multi-Layer Perceptron (MLP) in Numpy\n",
    "\n",
    "We will use Numpy to build a *Multi-Layer Perceptron* for classification, i.e. a neural network (NN) with hidden layer(s).  \n",
    "In previous tutorial we had used *logistic regression*. An MLP is very similar, but also contains at least a hidden layer.\n",
    "Lets build this in Numpy. Please consult previous tutorial for Logistic regression.\n",
    "\n",
    "*(If you get stuck implementing the backward pass in Task 2 or Task 3, you can continue straight with Task 4.)*\n",
    "\n",
    "**Hint:** To debug all the below sections, it is very useful to print the shapes of arrays/tensors (with `print(tensor_name.shape)`). Follow how they change layer after layer, to double-check that behaviour is as expected."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-   \n",
    "import numpy as np\n",
    " \n",
    "class MLP_Numpy():\n",
    "    def __init__(self, rng):\n",
    "        # Construct and initialize network parameters\n",
    "        D_in = H_height*W_width # Dimension of input feature-vectors. Length of a vectorised image.\n",
    "        D_hid = 1000 # Dimension of Hidden layer.\n",
    "        D_out = C_classes # Dimension of Output layer. Number of classes \n",
    "        \n",
    "        ###### TODO: Initialize parameters of MLP by sampling from N(0,0.01). Fill the blanks #####\n",
    "        # Also see forward_pass(), to see how they are used. It will help you infer correct shapes.\n",
    "        w1 = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, ?????)) # D_in+1 because we add bias in forward pass\n",
    "        w2 = rng.normal(loc=0.0, scale=0.01, size=(??????, D_out)) # Params of 2nd layer\n",
    "        self.params = [w1, w2]\n",
    "        #################################################################################\n",
    "        \n",
    "    def forward_pass(self, batch_imgs):\n",
    "        # This function applies the model to the input. It essentially defines the model's architecture.\n",
    "        # Input: batch_imgs: Input batch. Numpy.array of [N=number of samples, H=height, W=width] floats.\n",
    "        # Returns: y_pred: array of [N, D_out] floats. y_pred[i] contains class posterior (probs) for sample i.\n",
    "        [w1, w2] = self.params\n",
    "        \n",
    "        unary_feature_for_bias = np.ones(shape=(batch_imgs.shape[0], 1)) # [N, 1] column vector.\n",
    "        x = np.concatenate((batch_imgs, unary_feature_for_bias), axis=1) # Extra feature=1 for bias.\n",
    "        \n",
    "        ################ TODO: Fill the blanks (???) ##############################################\n",
    "        # Build network of form: x -> fully connected -> relu -> fully connected -> softmax \n",
    "        # Hidden layer\n",
    "        h1 = x.dot(?????)\n",
    "        h1_relu = np.maximum(????, ????)\n",
    "        # Fully-connected classifier.\n",
    "        h1_ext = np.concatenate((?????, unary_feature_for_bias), axis=1) # Extend for bias. Output shape: [N, D_hid+1]\n",
    "        h2 = ????.????(w2)\n",
    "        ????? = h2\n",
    "        ############################################################################################\n",
    "        \n",
    "        # Softmax activation function.\n",
    "        exp_logits = np.exp(logits)\n",
    "        y_pred = exp_logits / np.sum(exp_logits, axis=1, keepdims=True) \n",
    "        # sum with Keepdims=True returns [N,1] array. It would be [N] if keepdims=False.\n",
    "        # Numpy broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).\n",
    "        \n",
    "        # Store activations. Will be needed for backprop.\n",
    "        self.__activations = [x, h1, h1_relu, h1_ext, h2, y_pred]\n",
    "        \n",
    "        return y_pred\n",
    "        \n",
    "        \n",
    "    def backward_pass(self, loss, y_real): # NOTE: Loss value is not directly needed for computing grads in closed form.\n",
    "        # Performs back-propagation in closed form.\n",
    "        # Computes the gradient of the loss with respect to trainable parameters.\n",
    "        # y_real: Array [N, D_out]. y_real[i] is one-hot representation of real training label for i-th sample.\n",
    "        # returns grads: list of arrays, one for each trainable parameter. \n",
    "        \n",
    "        [x, h1, h1_relu, h1_ext, h2, y_pred] = self.__activations\n",
    "        [w1, w2] = self.params\n",
    "        \n",
    "        N = y_pred.shape[0]\n",
    "        D_out = y_pred.shape[1]\n",
    "        \n",
    "        # Derivative of cross entropy wrt input to softmax (straight, without intermediate computes)\n",
    "        # Difficult derivation. Nice explanation at: https://deepnotes.io/softmax-crossentropy\n",
    "        grad_logits = (y_pred - y_real) / N \n",
    "        \n",
    "        ##### TODO: Fill in the blanks (???) to complete the back-prop for the 2-layers MLP ###########\n",
    "        # Consult Tutorial 1 to compare with Logistic Regression.\n",
    "        grad_w2 = h1_ext.transpose().dot(?????) # shape: [D_hid+1, D_out]\n",
    "        grad_h1_ext = grad_logits.dot(w2.transpose()) # shape: [N, D_hid+1]\n",
    "        grad_h1_relu = grad_h1_ext[:, :-1] # Drop the extra feature dimension for the bias [N, D_hid]\n",
    "        grad_h1 = grad_h1_relu.copy() # Derivatives of activation functions.\n",
    "        grad_h1[ ????? < 0] = 0 # Derivatives of activation functions.\n",
    "        grad_w1 = ?????.transpose().dot(grad_h1) # Shape [D_in+1, D_hid]\n",
    "        ###############################################################################################\n",
    "        \n",
    "        # Elements of below list must be in same order as in self.params.\n",
    "        # Each of the subarrays has the same shape as that of corresponding trainable parameter in self.params.\n",
    "        grads = [grad_w1, grad_w2]\n",
    "        return grads\n",
    "        \n",
    "        \n",
    "def cross_entropy_numpy(y_pred, y_real, eps=1e-7):\n",
    "    # Cross entropy\n",
    "    # y_pred: Predicted class-posterior probabilities, returned by forward_pass. Matrix of shape [N, D_out]\n",
    "    # y_real: One-hot representation of real training labels. Same shape as y_pred.\n",
    "    \n",
    "    ################# TODO: Complete the calculation of cross-entropy for each sample ################\n",
    "    x_entr_per_sample = - np.????( ???? * np.????( ????? + eps), axis=1) # Sum over classes, axis=1\n",
    "    ##################################################################################################\n",
    "    \n",
    "    loss = np.mean(x_entr_per_sample, axis=0) # Expectation of loss: Mean over samples (axis=0).\n",
    "    return loss\n",
    "\n",
    "\n",
    "def grad_descent_update_numpy(params, grads, learning_rate):\n",
    "    # params: list of trainable parameters, created in initializer of the model.\n",
    "    # grads: list of gradients, returned by backward pass of model. grads[i] corresponds to params[i]\n",
    "    # learning_rate: a float.\n",
    "    assert len(params)==len(grads)\n",
    "    for i in range(len(params)):\n",
    "        ##### TODO: Perform the updates of the parameters by filling the blanks (????) ###########\n",
    "        params[i] = ????? - ????? * grads[i]\n",
    "        ##########################################################################################\n",
    "    \n",
    "# Create the network\n",
    "SEED = 42\n",
    "rng = np.random.RandomState(seed=SEED) # Random number generator\n",
    "net = MLP_Numpy(rng=rng)\n",
    "# Start training\n",
    "gradient_descent(net,\n",
    "                 cross_entropy_numpy, # We are using the Numpy version we wrote above.\n",
    "                 grad_descent_update_numpy, # We are using the Numpy version we wrote above.\n",
    "                 rng,\n",
    "                 train_imgs_flat,\n",
    "                 train_lbls_onehot,\n",
    "                 test_imgs_flat,\n",
    "                 test_lbls_onehot,\n",
    "                 batch_size=40, \n",
    "                 learning_rate=1e-2,\n",
    "                 total_iters=400,\n",
    "                 iters_per_test=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you build the MLP correctly, the above should result in a report that shows the model is learning, with training *loss* decreasing towards zero, while training and testing *accuracy* increasing.\n",
    "\n",
    "When process finishes (400 iterations), metrics will be plotted in the bottom of the report (**scroll down if needed**).\n",
    "\n",
    "If you have *not* changed gradient_descent() in Task 1, you will see the model overfitting, with training accuracy reaching 100\\%, while testing accuracy finishes at 59.57\\%.\n",
    "\n",
    "**Q: Why the overfit, if Task 1 has not been addressed?**  \n",
    "\n",
    "If you have made gradient descent *stochastic* (random), overfitting will not occur with this small network. But you will notice instability in the plotted metrics. Training and testing accuracy should finish at 87.50\\% and 88.15\\% respectively.\n",
    "\n",
    "**Q: What factors cause the instability in the observed metrics, if Task1 has been addressed?**  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Task 3: MLP in PyTorch\n",
    "\n",
    "We now introduce a library built for development of deep neural networks, PyTorch. It will be used for the rest of the tutorials and the coursework.  \n",
    "So, it is very important to understand how it works.\n",
    "\n",
    "In the below we will write *exactly the same* MLP as in Task 2, but translate it to PyTorch. The changes you should do correspond to those in Task 2, just different (very similar) language. This will show you how similar it is to write Numpy and Pytorch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import torch\n",
    "\n",
    "\n",
    "class MLP_Torch():\n",
    "    def __init__(self, rng):\n",
    "        # Construct and initialize network parameters\n",
    "        D_in = H_height*W_width # Dimension of input feature-vectors. Length of a vectorised image.\n",
    "        D_hid = 1000 # Dimension of Hidden layer.\n",
    "        D_out = C_classes # Dimension of Output layer. Number of classes \n",
    "        \n",
    "        ######## TODO: Initialize parameters of the MLP by sampling them from N(0,0.01). ###########\n",
    "        # For Pytorch, we first make a numpy array with the *initial* weights...\n",
    "        w1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, ?????)) # Similar to Task1\n",
    "        w2_init = rng.normal(loc=0.0, scale=0.01, size=(??????, ?????)) # Similar to Task1\n",
    "        # ... and then we initialize with them the actual *tensor*. This one will change with training.\n",
    "        w1 = torch.tensor(w1_init, dtype=torch.float)\n",
    "        w2 = torch.tensor(w2_init, dtype=torch.float)\n",
    "        self.params = [w1, w2]\n",
    "        ##########################################################################################\n",
    "        \n",
    "    def forward_pass(self, batch_imgs):\n",
    "        # compute predicted y\n",
    "        [w1, w2] = self.params\n",
    "        \n",
    "        batch_imgs_t = torch.tensor(batch_imgs, dtype=torch.float)\n",
    "        \n",
    "        unary_feature_for_bias = torch.ones(size=(batch_imgs.shape[0], 1)) # [N, 1] column vector.\n",
    "        x = torch.cat((batch_imgs_t, unary_feature_for_bias), dim=1) # Extra feature=1 for bias.\n",
    "        \n",
    "        ################ TODO: Fill the blanks (???) ##############################################\n",
    "        # Build network of form: x -> fully connected -> relu -> fully connected -> softmax \n",
    "        # Hidden layer\n",
    "        h1 = x.mm(?????)\n",
    "        h1_relu = ????.clamp(min=????)\n",
    "        # Fully-connected classifier (aka Densely connected classifier)\n",
    "        h1_ext = torch.cat((?????, unary_feature_for_bias), dim=????)\n",
    "        h2 = ????.????(w2)\n",
    "        ????? = h2\n",
    "        ######################################################################################\n",
    "        \n",
    "        # Softmax activation function.\n",
    "        exp_logits = torch.exp(logits)\n",
    "        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) \n",
    "        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.\n",
    "        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).\n",
    "\n",
    "        # Store activations. Will be needed for backprop.\n",
    "        self.__activations = [x, h1, h1_relu, h1_ext, h2, y_pred]\n",
    "        \n",
    "        return y_pred        \n",
    "        \n",
    "        \n",
    "    def backward_pass(self, loss, y_real): # NOTE: It does not require the loss. Grads computed closed form.\n",
    "        # Performs back-propagation in closed form.\n",
    "        # Computes the gradient of the loss with respect to trainable parameters.\n",
    "        # y_real: Array [N, D_out]. y_real[i] is one-hot representation of real training label for i-th sample.\n",
    "        # returns grads: list of arrays, one for each trainable parameter. \n",
    "        \n",
    "        [x, h1, h1_relu, h1_ext, h2, y_pred] = self.__activations\n",
    "        [w1, w2] = self.params\n",
    "        \n",
    "        y_real = torch.tensor(y_real, dtype=torch.float) if type(y_real) is np.ndarray else y_real\n",
    "        \n",
    "        N = y_pred.shape[0]\n",
    "        D_out = y_pred.shape[1]\n",
    "        \n",
    "        # Backprop from loss all the way back, to compute gradients of trainable parameters with respect to loss.\n",
    "        \n",
    "        # Derivative of cross entropy wrt input to softmax (straight, without intermediate computes)\n",
    "        # Difficult derivation. Nice explanation at: https://deepnotes.io/softmax-crossentropy\n",
    "        grad_logits = (y_pred - y_real) / N \n",
    "        \n",
    "        ##### TODO: Fill in the blanks (???) to complete the back-prop for the 2-layers MLP ###########\n",
    "        # Compare with Numpy, it may help.\n",
    "        grad_w2 = h1_ext.t().mm(?????) # shape: [D_hid+1, D_out]\n",
    "        grad_h1_ext = grad_logits.mm(????.????()) # shape: [N, D_hid+1]. Compare with Numpy, it may help.\n",
    "        grad_h1_relu = grad_h1_ext[:, :-1] # Drop the extra feature dimension for the bias [N, D_hid]\n",
    "        grad_h1 = grad_h1_relu.clone()\n",
    "        grad_h1[????? < 0] = 0 # Derivatives of activation functions.\n",
    "        grad_w1 = ????.t().mm(grad_h1)\n",
    "        ################################################################################################\n",
    "        \n",
    "        # Elements of below list must be in same order as in self.params.\n",
    "        # Each of the subarrays has the same shape as that of corresponding trainable parameter in self.params.\n",
    "        grads = [grad_w1, grad_w2]\n",
    "        return grads\n",
    "        \n",
    "\n",
    "# Same used by Autograd and CNN\n",
    "def cross_entropy_torch(y_pred, y_real, eps=1e-7):\n",
    "    # Cross entropy\n",
    "    # y_pred: Predicted class-posterior probabilities, returned by forward_pass. Numpy array of shape [N, D_out]\n",
    "    # y_real: One-hot representation of real training labels. Same shape as y_pred.\n",
    "    \n",
    "    # If number array is given, change it to a Torch tensor.\n",
    "    y_pred = torch.tensor(y_pred, dtype=torch.float) if type(y_pred) is np.ndarray else y_pred\n",
    "    y_real = torch.tensor(y_real, dtype=torch.float) if type(y_real) is np.ndarray else y_real\n",
    "    \n",
    "    ######## TODO: (similar to numpy) Complete the calculation of cross-entropy for each sample ###########\n",
    "    x_entr_per_sample = - torch.????( ???? * torch.????( ???? + eps ), dim=1) # Sum over classes, axis=1\n",
    "    #######################################################################################################\n",
    "    \n",
    "    loss = torch.mean(x_entr_per_sample, dim=0) # Expectation of loss: Mean over samples (axis=0).\n",
    "    return loss\n",
    "\n",
    "\n",
    "def grad_descent_update_torch(params, grads, learning_rate):\n",
    "    # params: list of trainable parameters, created in initializer of the model.\n",
    "    # grads: list of gradients, returned by backward pass of model. grads[i] corresponds to params[i]\n",
    "    # learning_rate: a float.\n",
    "    assert len(params)==len(grads)\n",
    "    for i in range(len(params)):\n",
    "        ############## TODO: Perform the updates of the parameters ###############################\n",
    "        # Similar to Numpy, but for torch tensors you should use the param.data and grad.data for assignments.\n",
    "        params[i].data = ??????.data - ??????? * grads[i].data\n",
    "        ##########################################################################################\n",
    "        \n",
    "# Create the network\n",
    "rng = np.random.RandomState(seed=SEED) # Random number generator\n",
    "net = MLP_Torch(rng=rng)\n",
    "# Start training\n",
    "gradient_descent(net,\n",
    "                 cross_entropy_torch, # We are using the Torch version we wrote above.\n",
    "                 grad_descent_update_torch, # We are using the Torch version we wrote above.\n",
    "                 rng,\n",
    "                 train_imgs_flat,\n",
    "                 train_lbls_onehot,\n",
    "                 test_imgs_flat,\n",
    "                 test_lbls_onehot,\n",
    "                 batch_size=40, \n",
    "                 learning_rate=1e-2,\n",
    "                 total_iters=400,\n",
    "                 iters_per_test=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If the correct changes are done, you should get *exactly* the same results as in Task 2.\n",
    "\n",
    "**For more material on transitioning from Numpy to PyTorch, you can check this tutorial: https://pytorch.org/tutorials/beginner/pytorch_with_examples.html**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 4: Auto-differentiation (aka Autograd in PyTorch)\n",
    "\n",
    "Hopefully Task 3 convinced you it's similar writing in Numpy and PyTorch. But then, why bother with PyTorch?\n",
    "\n",
    "Because it offers auto-differentiation (called autograd in PyTorch). In Task 2 we wrote manually the back-propagation. For this, we need to derive the derivative of each of a network's component, and implement it. It can get extremely tedious for larger nets. Libraries build for Deep Learning (like PyTorch, Tensorflow, Theano, Chainer, etc), offer auto-differentiation: When the forward pass is implemented, the framework keeps track of the operations performed on the tensors. The backend has already implementations of each operation's derivative. To do the backprop, the library's back end will take care of applying the chain-rule automatically.\n",
    "\n",
    "Below, we implement a new MLP class, inheritting from Task 3 class, and make only the minor changes required to use Autograd in the backwards pass..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import torch\n",
    "\n",
    "class MLP_Torch_Autograd(MLP_Torch):\n",
    "    def __init__(self, rng):\n",
    "        # Construct and initialize network parameters\n",
    "        D_in = H_height*W_width # Dimension of input feature-vectors. Length of a vectorised image.\n",
    "        D_hid = 1000 # Dimension of Hidden layer.\n",
    "        D_out = C_classes # Dimension of Output layer. Number of classes \n",
    "        \n",
    "        ######## TODO: Add a hidden layer, to change this perceptron to an MLP ##########\n",
    "        # EXACTLY as above for MLP_Pytorch (copy-paste)...\n",
    "        w1_init = rng.normal(loc=0.0, scale=0.01, size=(D_in+1, ????)) # Similar to task 1/2\n",
    "        w2_init = rng.normal(loc=0.0, scale=0.01, size=(?????, ????)) # Similar to task 1/2\n",
    "        # ... EXCEPT for passing one more argument!\n",
    "        # requires_grad=True tells Torch that it will later need to automatically compute grads for this!\n",
    "        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=????) # <------- Important. Set it!\n",
    "        w2 = torch.tensor(w2_init, dtype=torch.float, requires_grad=????) # <------- Important Set it!\n",
    "        self.params = [w1, w2]\n",
    "        ###########################################################################\n",
    "        \n",
    "        \n",
    "    def forward_pass(self, batch_imgs):\n",
    "        return MLP_Torch.forward_pass(self, batch_imgs) # Calls parent's (MLP_Torch). No change.\n",
    "        \n",
    "    def backward_pass(self, loss, y_real):\n",
    "        ######### TODO: Return automatically computed gradients in a list (ala task1/2) ################\n",
    "        # We no longer need to write down the backward pass, for parameters were requires_grads=True, \n",
    "        # Calling loss.backward(), torch's Autograd automatically computes grads of loss wrt each parameter p,...\n",
    "        # ... and puts them in p.grad. Return them in a list.\n",
    "        loss.backward()\n",
    "        grads = [param.????? for param in self.params]\n",
    "        return grads\n",
    "        ################################################################################################\n",
    "        \n",
    "\n",
    "def grad_descent_update_autograd(params, grads, learning_rate):\n",
    "    # params: Type: Tensor. List of trainable parameters, created in initializer of the model. \n",
    "    # grads:  Type: Tensor. List of gradients, returned by backward pass. grads[i] corresponds to params[i].\n",
    "    # learning_rate: a float.\n",
    "    assert len(params)==len(grads)\n",
    "    for i in range(len(params)):\n",
    "        \n",
    "        ############## TODO: Perform the updates of the parameters ###############################\n",
    "        # Same as function grad_descent_update_torch() previously (copy paste).\n",
    "        ?????????.data -= ??????????? * ?????????.data\n",
    "        # IMPORTANT: Weirdness of Torch's Autograd: Need to manually set gradients to zero at end of iteration.\n",
    "        # ... otherwise loss.backward() keeps accumulating them, and you would get wrong results.\n",
    "        ?????[i].zero_() # *** IMPORTANT. REMEMBER THIS FOR FUTURE & COURSEWORK! ***\n",
    "        ##########################################################################################\n",
    "\n",
    "# Create the network\n",
    "rng = np.random.RandomState(seed=SEED) # Random number generator\n",
    "net = MLP_Torch_Autograd(rng=rng)\n",
    "# Start training\n",
    "gradient_descent(net,\n",
    "                 cross_entropy_torch, # Same as before.\n",
    "                 grad_descent_update_autograd, # Use the above version, with auto-differentiation.\n",
    "                 rng,\n",
    "                 train_imgs_flat,\n",
    "                 train_lbls_onehot,\n",
    "                 test_imgs_flat,\n",
    "                 test_lbls_onehot,\n",
    "                 batch_size=40, \n",
    "                 learning_rate=1e-2,\n",
    "                 total_iters=400,\n",
    "                 iters_per_test=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If completed correctly, the results should be *exactly* the same as in Task 2 and 3.  \n",
    "**Please note** the implementation detail of zeroing the gradients in the end of function grad_descent_update_autograd(), needed when Autograd is used.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Task 5: Building a basic Convolutional Neural Network in PyTorch\n",
    "\n",
    "The predominant neural networks used in computer vision are Convolutional Neural Networks (CNNs). Their most popular components are convolutions, for processing information and deriving more complex features, and pooling layers for aggregating & downsampling the information.\n",
    "\n",
    "We will below familiarize ourselves with convolutions and pooling in PyTorch. We will change the 2-layers MLP of the previous tasks to a CNN of the form:  \n",
    "` x -> conv -> bias -> pool -> flatten -> bias -> fully connected layer -> y_pred`  \n",
    "\n",
    "Aims:\n",
    "- Understand the corresponding functions of PyTorch.\n",
    "- Understand how each convolutional/pooling layer changes the dimensions of the input feature maps.  \n",
    "\n",
    "These points are important for building more complex models in later tutorials/courseworks, so spend some time experimenting."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import torch\n",
    "from torch.nn import functional as F\n",
    "\n",
    "\"\"\"\n",
    "We provide the main API of the conv2d and max_pool2d operations here for convenience:\n",
    "\n",
    "conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1)\n",
    "    input: input tensor of shape: (minibatch, in_channels, Height, Width)\n",
    "    weight: filters of shape: (out_channels, in_channels, kernel's Height, kernel's Width)\n",
    "    bias: optional bias tensor of shape: (out_channels).\n",
    "    stride: stride of the convolving kernel. single scalar or (sH, sW).\n",
    "    padding: zero paddings on both sides of input. Single number or tuple (padH, padW).\n",
    "    dilation: dilation rate for dilated convolutions (though we don't discuss it here).\n",
    "    \n",
    "For docs, search for *conv2d* at: https://pytorch.org/docs/stable/_modules/torch/nn/functional.html\n",
    "... or write create a new cell in this notebook, and after the above import, write and run: F.conv2d?\n",
    "\n",
    "F.max_pool2d(input, kernel_size, stride=None, padding=0, ceil_mode=False)\n",
    "    input: input tensor of shape: (minibatch, in_channels, Height, Width)\n",
    "    kernel_size: size of the pooling kernel. (Kernel's Height, kernel's Width)\n",
    "    stride: stride of the pooling kernel as it convolves the input. single scalar or (stride-H, stride-W).\n",
    "    padding: zero paddings on both sides of input. Single number or tuple (padH, padW).\n",
    "    ceil_mode: If True, will use `ceil` instead of `floor` to compute output shape when kernel partly runs out of input borders.\n",
    "\n",
    "For docs, search for *max_pool2d* at: https://pytorch.org/docs/stable/_modules/torch/nn/functional.html\n",
    "... or write create a new cell in this notebook, and after the above import, write and run: F.max_pool2d?\n",
    "\"\"\"\n",
    "\n",
    "class CNN_Torch_Autograd():\n",
    "    def __init__(self, rng):\n",
    "        # Construct and initialize network parameters, to build a CNN of the form:\n",
    "        # x -> conv[5x5] -> bias -> maxpool k[2x2],s[2,2] -> flatten -> bias -> fully conn. layer -> y_pred\n",
    "        D_in = 1 # Channels of input image.\n",
    "        D_hid = 10\n",
    "        ######### QUESTION: How is the below number calculated for this architecture? ###################\n",
    "        # D_in_fc is the length of hidden feature vector per sample, given as input to last fully conn. layer.\n",
    "        # Why this number? It's architecture dependent.\n",
    "        # To answer, check the rest of this function and forward_pass() to understand the architecture.\n",
    "        # Consider how a conv and a pool change the shape of the input. Print shapes in forward_pass to investigate.\n",
    "        D_in_fc = 1440\n",
    "        #################################################################################################\n",
    "        D_out = C_classes # Dimension of Output layer. Number of classes \n",
    "        \n",
    "        ######## TODO: Initialize [5x5] convolutional kernels and biases ############\n",
    "        # Initialize conv kernels and biases for CNN of form:\n",
    "        # x -> conv[5x5] -> bias -> maxpool k[2x2],s[2,2] -> flatten -> bias -> fully conn. layer -> y_pred\n",
    "        # Consult the forward pass, to get an idea about the architecture, to help filling this section.\n",
    "        # Notice: biases are implemented differently in CNNs than MLPs....\n",
    "        # ... They are implemented by adding a different *trainable* bias to each channel (feature dimension).\n",
    "        w1_init = rng.normal(loc=0.0, scale=0.01, size=(?????, D_in, 5, ?????)) # See Conv's API for expected shape.\n",
    "        b1_init = rng.normal(loc=0.0, scale=0.01, size=(D_hid)) # One bias per channel of hidden layer\n",
    "        w2_init = rng.normal(loc=0.0, scale=0.01, size=(D_in_fc, ?????)) # Classification layer, similar to MLP.\n",
    "        b2_init = rng.normal(loc=0.0, scale=0.01, size=(?????)) # One bias per channel of hidden layer\n",
    "        w1 = torch.tensor(w1_init, dtype=torch.float, requires_grad=True)\n",
    "        b1 = torch.tensor(b1_init, dtype=torch.float, requires_grad=True)\n",
    "        w2 = torch.tensor(w2_init, dtype=torch.float, requires_grad=True)\n",
    "        b2 = torch.tensor(b2_init, dtype=torch.float, requires_grad=True)\n",
    "        self.params = [w1, b1, w2, b2]\n",
    "        #############################################################################################\n",
    "        \n",
    "        \n",
    "    def forward_pass(self, batch_imgs):\n",
    "        # compute predicted y\n",
    "        [w1, b1, w2, b2] = self.params\n",
    "        \n",
    "        x = torch.tensor(batch_imgs, dtype=torch.float)\n",
    "        \n",
    "        ####################### TODO: Fill in the blanks to create a basic CNN ######################\n",
    "        # Make CNN: x -> conv[5x5] -> bias -> maxpool k[2x2],s[2,2] -> flatten -> bias -> fully conn. layer -> y_pred\n",
    "        # Hidden layer\n",
    "        print(\"[x] shape: \", x.shape)\n",
    "        h1 = F.conv2d(input=??????, ?????=w1, bias=None, stride=1, padding=0, dilation=1)\n",
    "        print(\"[h1] shape: \", h1.shape)\n",
    "        b1_resh = b1.reshape([1, b1.shape[0], 1, 1]) # Add unary dims, so that afterwards it can get broadcasted\n",
    "        h1_bias = ?????? + b1_resh #Unary dims (N,-,W,H) get broadcasted for the addition.\n",
    "        h1_relu = ??????.clamp(min=0)\n",
    "        print(\"[h1_relu] shape: \", h1_relu.shape)\n",
    "        h1_pool = F.max_pool2d(????, kernel_size=[2,2], stride=[2,2], padding=0, ceil_mode=True)\n",
    "        print(\"[h1_pool] shape: \", h1_pool.shape)\n",
    "        # Fully-connected classifier (aka Densely connected classifier)\n",
    "        h1_relu_flat = h1_pool.reshape( [h1_pool.shape[0], -1] ) # Flatten activations, to pass to fully conn layer.\n",
    "        print(\"[h1_relu_flat] shape: \", h1_relu_flat.shape)\n",
    "        h2 = h1_relu_flat.mm(??????)\n",
    "        logits = ????? + ??????.reshape([1, b2.shape[0]]) # Add bias to activation, after reshape for broadcasting.\n",
    "        print(\"[logits] shape: \", logits.shape)\n",
    "        ##############################################################################################\n",
    "        \n",
    "        # Softmax activation function\n",
    "        exp_logits = torch.exp(logits)\n",
    "        y_pred = exp_logits / torch.sum(exp_logits, dim=1, keepdim=True) \n",
    "        # sum with Keepdim=True returns [N,1] array. It would be [N] if keepdim=False.\n",
    "        # Torch broadcasts [N,1] to [N,D_out] via repetition, to divide elementwise exp_h2 (which is [N,D_out]).\n",
    "        \n",
    "        return y_pred\n",
    "    \n",
    "    def backward_pass(self, loss, y_real):\n",
    "        # Same as MLP in autograd.\n",
    "        loss.backward()\n",
    "        grads = [param.grad for param in self.params]\n",
    "        return grads\n",
    "        \n",
    "        \n",
    "########################## IMPORTANT NOTE (nothing to do) ###############################################\n",
    "# The MLP receives each sample as a vector. Thus we had vectorized/flattened the 2D images prior to Task 1.\n",
    "# The CNN processes 2D inputs directly. Vectorizing them is not needed.\n",
    "# BUT PyTorch receives input to conv/pool of Torch needs to be of shape [N, Channels, H, W], with...\n",
    "# ... second dimension being the number of input channels (number of input feature maps if hidden layer).\n",
    "# For grayscale images, there is only 1 image channel. For RGB images, this would be 3. We add this dimension:\n",
    "IMAGE_CHANNELS = 1\n",
    "train_imgs_cnn = train_imgs.reshape([train_imgs.shape[0], IMAGE_CHANNELS, train_imgs.shape[1], train_imgs.shape[2]])\n",
    "test_imgs_cnn = test_imgs.reshape([test_imgs.shape[0], IMAGE_CHANNELS, test_imgs.shape[1], test_imgs.shape[2]])\n",
    "#######################################################################################################\n",
    "\n",
    "# Create the network\n",
    "rng = np.random.RandomState(seed=SEED) # Random number generator\n",
    "net = CNN_Torch_Autograd(rng=rng)\n",
    "# Start training\n",
    "gradient_descent(net,\n",
    "                 cross_entropy_torch, # Same as MLP.\n",
    "                 grad_descent_update_autograd, # Same as MLP.\n",
    "                 rng,\n",
    "                 train_imgs_cnn,\n",
    "                 train_lbls_onehot,\n",
    "                 test_imgs_cnn,\n",
    "                 test_lbls_onehot,\n",
    "                 batch_size=40, \n",
    "                 learning_rate=1e-2,\n",
    "                 total_iters=400,\n",
    "                 iters_per_test=10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If completed successfully, this basic CNN should be trainable in less than a minute on a CPU. Its testing accuracy should finish at 87.61\\% for the stochastic version of gradient descent, similar to that of the MLP.\n",
    "\n",
    "\n",
    "**Q: How do we compute the number of input neurons to the last hidden layer?**  \n",
    "D_in_fc has been given equal to 1440. Why? How is this number computed? It is architecture dependent. To answer, consider how the shape of a tensor changes as it propagates through the CNN. How do the convs and pool operators change the shape of a tensor, with respect to their kernel size, stride, use of padding etc? This is very important to understand, to be able to implement larger and more complex networks.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bonus (optional):\n",
    "    \n",
    "Experiment yourselves with adding more conv/pool modules, to build a deeper net. Deeper networks (ala LeNet) can exceed 99\\% test accuracy. Can you approach this by adding more layers?  \n",
    "You will find that deeper nets not only quickly become expensive to train, but are also not easy to train. They require careful hyperparameter configuration (learning rate, batch size, initialization, total training iterations, etc). How hyper-parameters change behaviour during training is not well understood theoretically, and currently depends on practical experience. Get some if you can.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## This notebook:\n",
    "Copyright 2020, Imperial College London  \n",
    "For issues e-mail: konstantinos.kamnitsas12@imperial.ac.uk"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}