"In this tutorial, we will learn basics of image IO and simple processing, and visualisation in Python. \n",
"If you want to refresh your python basics, please check this [tutorial](http://cs231n.github.io/python-numpy-tutorial/) from the computer vision course at Stanford.\n",
"\n",
"By the end of the tutorial, you should be able to:\n",
"1. Use python, numpy, and run jupyter notebook\n",
"2. Build a simple binary classifier \n",
"3. Implement a logistic regression classifier using numpy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"### Import stuff and set up some helper functions"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import common libraries\n",
"import numpy as np\n",
"\n",
"# adjust settings to plot nice figures inline\n",
"In a real ML task, data would be available in a database and organised in tables, documents or files. In this tutorial, we will be using the [MNIST dataset](http://yann.lecun.com/exdb/mnist/), small images of digits handwritten by high school students and employees of the US Census Bureau. It consists of a training set of 60,000 examples, and a test set of 10,000 examples. Each image is size-normalized and centered in a fixed-size image 28x28 pixels, and labeled with the digit it represents. It is kind of the *hello world* of machine learning for imaging. You can find more benchmark datasets [here](https://pytorch.org/docs/stable/torchvision/datasets.html)\n"
"Here, we will sort our data and fix the random seed to ensure geting same results everytime you run the experiments. Then plot some sampled digits after sorting the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# we will sort our data and fix the random generator seed to get similar results from different runs\n",
"np.random.seed(42)\n",
"\n",
"# sort dataset\n",
"def sort_data(data, labels):\n",
" sorted_idxs = np.array(sorted([(target, i) for i, target in enumerate(labels)]))[:, 1]\n",
"Now our data are cleaned and sorted, we will train a simple binary classifier to distinguish between two selected digits. \n",
"\n",
"Data usually is divided into three sets for training, validation, and testing. The training data is used to train the model's parameters, while the validation set is used to adjust the model's hyperparameters. Finally, the performance of the trained model is evaluated on the testing data. For this tutorial we will split the data into train and test for simplification. \n",
"\n",
"**Task**\n",
"\n",
"1. Extract ones and eights from both training and testing data\n",
"2. Shuffle training data\n",
"3. Plot number of images versus number of 'white' pixels per image\n",
"4. Can you predict the label based only on the number of 'white' pixels? What's the training and testing error for such an approach?"
"Repeat the previous examples to classify digits 0s and 8s instead of 1s and 8s. Will the threshold binary classifier differentiate between the two categories based on number of 'white' pixels?"
"In the previous example, we used a simple threshold to classify each image of a digit using one feature (number of 'white' pixels).\n",
"\n",
"Here, we will use a logistic regression model for the same task but using raw pixel information as input features. The logistic regression function is defined as: $h_{\\Theta}(\\mathbf{x}) = \\frac{1}{1 + \\exp(- \\Theta^{\\top} \\mathbf{x})}$.\n",
"\n",
"It's useful to group all training samples into one big matrix $\\mathbf{X}$ of size *(number_samples x number_features)*, and their labels into one vector $\\mathbf{y}$ as in the code below.\n",
"\n",
"Training our model is a loop that includes three main steps\n",
"1. Evaluate the cost function $J(\\Theta)$\n",
"2. Compute partial derivatives\n",
"3. Update the model paramteters\n",
"\n",
"---\n",
"\n",
"**Task**\n",
"\n",
"1. Complete the logistic regression class below \n",
"2. Train a logistic regression model on the data from the previous example\n",
"3. Compute train and test accuracies, and compare with the previous results\n",
"4. Plot the trained parameters and comment on the figure"
"## Using explicit features for classification\n",
"\n",
"We have now seen how we can build a digit classifier using the raw pixel information as features. In some ML applications, it is possible (or even desired) to hand engineer the feature extraction stage. Here, we are exploring how far we can get with morphometric features extracted for MNIST digits, namely the area, length, thickness, slant, width, height."
"This time we use a logistic regression model from [scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).\n",
"\n",
"Train the logistic regression on the data and calculate the classification accuracy for both training and testing."