Commit ea4da372 authored by Ben Glocker's avatar Ben Glocker

tutorial files

parent 1ebf50b8
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 1: Introduction to Machine Learning for Imaging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running on Colab"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://www.doc.ic.ac.uk/~bglocker/teaching/notebooks/supervised-data.zip\n",
"! unzip supervised-data.zip\n",
"\n",
"# data directory\n",
"data_dir = 'data/mnist/'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Running on DoC lab machines"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# data directory\n",
"data_dir = '/vol/lab/course/416/data/mnist/'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Image classification with Python"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, we will learn basics of image IO and simple processing, and visualisation in Python. \n",
"If you want to refresh your python basics, please check this [tutorial](http://cs231n.github.io/python-numpy-tutorial/) from the computer vision course at Stanford.\n",
"\n",
"By the end of the tutorial, you should be able to:\n",
"1. Use python, numpy, and run jupyter notebook\n",
"2. Build a simple binary classifier \n",
"3. Implement a logistic regression classifier using numpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"### Import stuff and set up some helper functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import common libraries\n",
"import numpy as np\n",
"\n",
"# adjust settings to plot nice figures inline\n",
"%matplotlib inline\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"plt.rcParams['axes.labelsize'] = 14\n",
"plt.rcParams['xtick.labelsize'] = 12\n",
"plt.rcParams['ytick.labelsize'] = 12"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"\n",
"#########################################################\n",
"# functions to plot digits\n",
"#########################################################\n",
"\n",
"def plot_digit(data):\n",
" image = data.reshape(28, 28)\n",
" plt.imshow(image, cmap = matplotlib.cm.gray,\n",
" interpolation=\"nearest\")\n",
" plt.colorbar()\n",
" # plt.axis(\"off\")\n",
"\n",
"\n",
"def plot_digits(data, n_samples_row=10):\n",
" images = [image.reshape(28,28) for image in data]\n",
" n_rows = (len(images) - 1) // n_samples_row + 1\n",
" # append empty images if the last row is not complete\n",
" empty_images = n_rows * n_samples_row - len(data)\n",
" images.append(np.zeros((28, 28 * empty_images)))\n",
" # draw row by row\n",
" images_row = []\n",
" for current_row in range(n_rows):\n",
" tmp_row_images = images[current_row * n_samples_row : (current_row + 1) * n_samples_row]\n",
" images_row.append(np.concatenate(tmp_row_images, axis=1))\n",
" # draw all in one image\n",
" image = np.concatenate(images_row, axis=0)\n",
" plt.figure(figsize=(n_samples_row,n_rows))\n",
" plt.imshow(image, cmap = matplotlib.cm.gray)\n",
" plt.colorbar()\n",
" # plt.axis(\"off\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## MNIST digit recognition\n",
"\n",
"In a real ML task, data would be available in a database and organised in tables, documents or files. In this tutorial, we will be using the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), small images of digits handwritten by high school students and employees of the US Census Bureau. It consists of a training set of 60,000 examples, and a test set of 10,000 examples. Each image is size-normalized and centered in a fixed-size image 28x28 pixels, and labeled with the digit it represents. It is kind of the *hello world* of machine learning for imaging. You can find more benchmark datasets [here](https://pytorch.org/docs/stable/torchvision/datasets.html)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torchvision.datasets as dset\n",
"\n",
"# train data\n",
"train_set = dset.MNIST(root=data_dir, train=True, download=True)\n",
"train_data = np.array(train_set.train_data)\n",
"train_labels = np.array(train_set.train_labels)\n",
"\n",
"# test data\n",
"test_set = dset.MNIST(root=data_dir, train=False, download=True)\n",
"test_data = np.array(test_set.test_data)\n",
"test_labels = np.array(test_set.test_labels)\\\n",
"\n",
"# print train and test data details\n",
"print('Train data:')\n",
"print('shape (images, x,y) = {}'.format(train_data.shape))\n",
"print('labels = {}'.format(np.unique(train_labels)))\n",
"\n",
"print('Test data:')\n",
"print('shape (images, x,y) = {}'.format(test_data.shape))\n",
"print('labels = {}'.format(np.unique(test_labels)))\n",
"\n",
"\n",
"# plot sample digits\n",
"plot_digits(train_set.train_data[:100])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"Here, we will sort our data and fix the random seed to ensure geting same results everytime you run the experiments. Then plot some sampled digits after sorting the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we will sort our data and fix the random generator seed to get similar results from different runs\n",
"np.random.seed(42)\n",
"\n",
"# sort dataset\n",
"def sort_data(data, labels):\n",
" sorted_idxs = np.array(sorted([(target, i) for i, target in enumerate(labels)]))[:, 1]\n",
" return data[sorted_idxs], labels[sorted_idxs]\n",
"\n",
"############################################################################\n",
"# Q: use the previous function to sort both training and testing data\n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE\n",
"\n",
"############################################################################\n",
"\n",
"# plot sampled images from sorted data\n",
"# here it samples 20 samples of [0,1], 30 samples of [2,3,4], and 50 samples of [5,6,7,8,9] - 10 samples for each digit\n",
"example_images = np.r_[train_data[:12000:600], train_data[13000:30600:600], train_data[30600:60000:590]]\n",
"plot_digits(example_images)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"### Simple Binary Classifier\n",
"\n",
"Now our data are cleaned and sorted, we will train a simple binary classifier to distinguish between two selected digits. \n",
"\n",
"Data usually is divided into three sets for training, validation, and testing. The training data is used to train the model's parameters, while the validation set is used to adjust the model's hyperparameters. Finally, the performance of the trained model is evaluated on the testing data. For this tutorial we will split the data into train and test for simplification. \n",
"\n",
"**Task**\n",
"\n",
"1. Extract ones and eights from both training and testing data\n",
"2. Shuffle training data\n",
"3. Plot number of images versus number of 'white' pixels per image\n",
"4. Can you predict the label based only on the number of 'white' pixels? What's the training and testing error for such an approach?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Extract sample digits of ones and eights\n",
"############################################################################\n",
"\n",
"def sample_data_digits(data, labels, labels_to_select):\n",
" # convert input 3d arrays to 2d arrays\n",
" nsamples, nx, ny = data.shape\n",
" data_vec = np.reshape(data,(nsamples,nx*ny))\n",
" \n",
" selected_indexes = np.isin(labels, labels_to_select)\n",
" selected_data = data_vec[selected_indexes]\n",
" selected_labels = labels[selected_indexes]\n",
" \n",
" # Convert images from gray to binary by thresholding intensity values\n",
" selected_data = 1.0 * (selected_data >= 128)\n",
"\n",
" # convert labels to binary: digit_0=False, digit_1=True\n",
" selected_labels = selected_labels==labels_to_select[1]\n",
" \n",
" # shuffle data\n",
" shuffle_index = np.random.permutation(len(selected_labels))\n",
" selected_data, selected_labels = selected_data[shuffle_index], selected_labels[shuffle_index]\n",
"\n",
" return selected_data, selected_labels\n",
"\n",
"\n",
"############################################################################\n",
"# Q: extract ones and eights digits from both training and testing data \n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE\n",
"\n",
"############################################################################\n",
"\n",
"# plot sampled digits\n",
"plot_digits(selected_train_data[0:50])\n",
"plt.show()\n",
"plot_digits(selected_train_data[8000:8050])\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Q:plot number of images versus number of 'white' foreground pixels \n",
"# for both 1s and 8s classes.\n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Q: select threshold value to sperate between the two classes\n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Q: classify digits using a threshold \n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Q: calculate both training and testing accuracy\n",
"# You should get accuracies around 89-90%\n",
"############################################################################\n",
"\n",
"train_acc = #INSERT CODE HERE\n",
"print('Train accuracy = {:.2f}%'.format(train_acc))\n",
"\n",
"test_acc = #INSERT CODE HERE\n",
"print('Test accuracy = {:.2f}%'.format(test_acc))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"**Task**\n",
"\n",
"Repeat the previous examples to classify digits 0s and 8s instead of 1s and 8s. Will the threshold binary classifier differentiate between the two categories based on number of 'white' pixels?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Q: extract zeros and eights digits from both training and testing data\n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE\n",
"\n",
"############################################################\n",
"# Q: plot number of images versus number of pixels\n",
"############################################################\n",
"\n",
"#INSERT CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"############################################################################\n",
"# Q: select threshold value to sperate between the two classes\n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE\n",
"\n",
"############################################################################\n",
"# Q: classify digits using a threshold \n",
"############################################################################\n",
"\n",
"#INSERT CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"train_acc = #INSERT CODE HERE\n",
"print('Train accuracy = {:.2f}%'.format(train_acc))\n",
"\n",
"test_acc = #INSERT CODE HERE\n",
"print('Test accuracy = {:.2f}%'.format(test_acc))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Logistic Regression using Numpy\n",
"\n",
"In the previous example, we used a simple threshold to classify each image of a digit using one feature (number of 'white' pixels).\n",
"\n",
"Here, we will use a logistic regression model for the same task but using raw pixel information as input features. The logistic regression function is defined as: $h_{\\Theta}(\\mathbf{x}) = \\frac{1}{1 + \\exp(- \\Theta^{\\top} \\mathbf{x})}$.\n",
"\n",
"It's useful to group all training samples into one big matrix $\\mathbf{X}$ of size *(number_samples x number_features)*, and their labels into one vector $\\mathbf{y}$ as in the code below.\n",
"\n",
"Training our model is a loop that includes three main steps\n",
"1. Evaluate the cost function $J(\\Theta)$\n",
"2. Compute partial derivatives\n",
"3. Update the model paramteters\n",
"\n",
"---\n",
"\n",
"**Task**\n",
"\n",
"1. Complete the logistic regression class below \n",
"2. Train a logistic regression model on the data from the previous example\n",
"3. Compute train and test accuracies, and compare with the previous results\n",
"4. Plot the trained parameters and comment on the figure"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class LogisticRegression:\n",
" def __init__(self, lr=0.05, num_iter=1000, add_bias=True, verbose=True):\n",
" self.lr = lr\n",
" self.verbose = verbose\n",
" self.num_iter = num_iter\n",
" self.add_bias = add_bias\n",
" \n",
" def __add_bias(self, X):\n",
" bias = np.ones((X.shape[0], 1))\n",
" return np.concatenate((bias, X), axis=1)\n",
" \n",
"\n",
"\n",
" def __loss(self, h, y):\n",
" ''' computes loss values '''\n",
" y = np.array(y,dtype=float) \n",
" ############################################################################\n",
" # Q: compute the loss \n",
" ############################################################################\n",
" return #INSERT CODE HERE\n",
"\n",
" \n",
" def fit(self, X, y):\n",
" ''' \n",
" Optimise our model using gradient descent\n",
" Arguments:\n",
" X input features\n",
" y labels from training data\n",
" \n",
" '''\n",
" if self.add_bias:\n",
" X = self.__add_bias(X)\n",
" \n",
" ############################################################################\n",
" # Q: initialise weights randomly with normal distribution N(0,0.01)\n",
" ############################################################################\n",
" self.theta = #INSERT CODE HERE\n",
" \n",
" for i in range(self.num_iter):\n",
" ############################################################################\n",
" # Q: forward propagation\n",
" ############################################################################\n",
" z = #INSERT CODE HERE\n",
" h = #INSERT CODE HERE\n",
" ############################################################################\n",
" # Q: backward propagation\n",
" ############################################################################\n",
" gradient = #INSERT CODE HERE\n",
" # update parameters\n",
" self.theta -= #INSERT CODE HERE\n",
" ############################################################################\n",
" # Q: print loss\n",
" ############################################################################\n",
" if(self.verbose == True and i % 50 == 0):\n",
" h = #INSERT CODE HERE\n",
" print('loss: {} \\t'.format(self.__loss(h, y)))\n",
" \n",
" def predict_probs(self,X):\n",
" ''' returns output probabilities\n",
" '''\n",
" ############################################################################\n",
" # Q: forward propagation\n",
" ############################################################################\n",
" if self.add_bias:\n",
" X = self.__add_bias(X)\n",
" z = #INSERT CODE HERE\n",
" return #INSERT CODE HERE\n",
"\n",
" def predict(self, X, threshold=0.5):\n",
" ''' returns output classes\n",
" '''\n",
" return self.predict_probs(X) >= threshold\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#########################################################################\n",
"# Q: train our model\n",
"#########################################################################\n",
"\n",
"#INSERT CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#########################################################################\n",
"# Q: Evaluate the trained model - compute train and test accuracies\n",
"# You should get accuracies around 98-99%\n",
"#########################################################################\n",
"train_preds = #INSERT CODE HERE\n",
"logistic_train_acc = #INSERT CODE HERE\n",
"print('Train accuracy = {:.2f}%'.format(logistic_train_acc))\n",
"\n",
"test_preds = #INSERT CODE HERE\n",
"logistic_test_acc = #INSERT CODE HERE\n",
"print('Test accuracy = {:.2f}%'.format(logistic_test_acc))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#########################################################################\n",
"# Plot trained model params (weights) as an image of size (28x28)\n",
"#########################################################################\n",
"plt.imshow(model.theta[:-1].reshape(28,28))\n",
"plt.colorbar()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"## Using explicit features for classification\n",
"\n",
"We have now seen how we can build a digit classifier using the raw pixel information as features. In some ML applications, it is possible (or even desired) to hand engineer the feature extraction stage. Here, we are exploring how far we can get with morphometric features extracted for MNIST digits, namely the area, length, thickness, slant, width, height."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"\n",
"# Reload MNIST\n",
"train_data = np.array(train_set.train_data)\n",
"train_labels = np.array(train_set.train_labels)\n",
"\n",
"# test data\n",
"test_data = np.array(test_set.test_data)\n",
"test_labels = np.array(test_set.test_labels)\n",
"\n",
"# Read the meta data using pandas\n",
"train_features = pd.read_csv(data_dir + 'train-morpho.csv')\n",
"test_features = pd.read_csv(data_dir + 't10k-morpho.csv')\n",
"train_features.head() # show the first five data entries of the training set"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(test_features['area'],test_features['length'], marker='.', c=test_labels)\n",
"plt.grid()\n",
"plt.xlabel('feature i')\n",
"plt.ylabel('feature j')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Reformat the data\n",
"X_train = np.transpose(np.array([train_features['area'].values,train_features['length'].values,train_features['thickness'].values,train_features['slant'].values,train_features['width'].values,train_features['height'].values]))\n",
"X_test = np.transpose(np.array([test_features['area'].values,test_features['length'].values,test_features['thickness'].values,test_features['slant'].values,test_features['width'].values,test_features['height'].values]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def sample_data(data, labels, labels_to_select):\n",
" selected_indexes = np.isin(labels, labels_to_select)\n",
" selected_data = data[selected_indexes]\n",
" selected_labels = labels[selected_indexes]\n",
"\n",
" # convert labels to binary: digit_0=False, digit_1=True\n",
" selected_labels = selected_labels==labels_to_select[1]\n",
" \n",
" # shuffle data\n",
" shuffle_index = np.random.permutation(len(selected_labels))\n",
" selected_data, selected_labels = selected_data[shuffle_index], selected_labels[shuffle_index]\n",
"\n",
" return selected_data, selected_labels"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar to above, let's pick data for a simple binary classification between two digits. Let's start with 0s and 8s."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"labels_to_select = [0,8]\n",
"selected_train_data, selected_train_labels = sample_data(X_train,train_labels,labels_to_select)\n",
"selected_test_data, selected_test_labels = sample_data(X_test,test_labels,labels_to_select)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"**Task**\n",
"\n",
"This time we use a logistic regression model from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
"\n",
"Train the logistic regression on the data and calculate the classification accuracy for both training and testing."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import linear_model\n",
"\n",
"#########################################################################\n",
"# Q: Use scikit-learn's logistic regression model\n",
"#########################################################################\n",
"\n",
"#INSERT CODE HERE"
]