{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.1"
    },
    "colab": {
      "name": "task_2_cnn_custom_embedding.ipynb",
      "provenance": [],
      "collapsed_sections": [
        "f1mLbiQdoJEM",
        "MlNWhyGYWMp_",
        "qGqeou9f0ZzD",
        "5GXHc2OjWwRQ",
        "8WymN6OTt7Gw",
        "iyOMIDgAM1dP",
        "jYis4uYGW0e5",
        "R12IslmZj5rN",
        "gvF44kqjj5y-",
        "_haGTYuJXVie",
        "Kj_nWr90krq9"
      ],
      "machine_shape": "hm"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "09bWQWD1nrq3"
      },
      "source": [
        "## Approach 2: No pre-trained representations"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2Bgiq_D7pCIV"
      },
      "source": [
        "### Coursework coding instructions (please also see full coursework spec)\n",
        "\n",
        "Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.\n",
        "\n",
        "For the task you choose you will need to do two approaches:\n",
        "  - Approach 1, which can use use pre-trained embeddings / models\n",
        "  - Approach 2, which should not use any pre-trained embeddings or models\n",
        "We should be able to run both approaches from the same colab file\n",
        "\n",
        "#### Running your code:\n",
        "  - Your models should run automatically when running your colab file without further intervention\n",
        "  - For each task you should automatically output the performance of both models\n",
        "  - Your code should automatically download any libraries required\n",
        "\n",
        "#### Structure of your code:\n",
        "  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required\n",
        "  - Otherwise there are no restrictions on what you can do in your code\n",
        "\n",
        "#### Documentation:\n",
        "  - You are expected to produce a .README file summarising how you have approached both tasks\n",
        "\n",
        "#### Reproducibility:\n",
        "  - Your .README file should explain how to replicate the different experiments mentioned in your report\n",
        "\n",
        "Good luck! We are really looking forward to seeing your reports and your model code!"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f1mLbiQdoJEM"
      },
      "source": [
        "### Setup"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4zxDPZ2HUr9B"
      },
      "source": [
        "# Environment variables\n",
        "data_dir = \"./data\"\n",
        "model_dir = \"./models\"\n",
        "\n",
        "# Random Seed\n",
        "SEED = 1\n",
        "\n",
        "# Model Architexture\n",
        "base_model = [\n",
        "    # expand_ratio, channels, repeats, stride, kernel_size\n",
        "    [1, 256, 1, 1, 3],\n",
        "    [3, 300, 2, 2, 3],\n",
        "    [3, 512, 2, 2, 3],\n",
        "]\n",
        "\n",
        "# Model Hyperparameters\n",
        "BATCH_SIZE = 32\n",
        "DROPOUT_RATE = 0.3\n",
        "EMBEDDING_DIM = 100 # word vect embedding dimension\n",
        "ENCODING_DIM = 1024 # hidden dimension of final pooling layer\n",
        "\n",
        "# Number of Epochs\n",
        "epochs = 30\n",
        "\n",
        "# Proportion of training data for train compared to dev\n",
        "train_proportion = 0.8"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0iA-C2dcpCJN"
      },
      "source": [
        "# Imports\n",
        "import torch\n",
        "import torch.nn as nn\n",
        "import torch.nn.functional as F\n",
        "import pandas as pd\n",
        "import numpy as np\n",
        "from sklearn.feature_extraction.text import CountVectorizer\n",
        "from torch.utils.data import Dataset, random_split\n",
        "from sklearn.feature_extraction.text import TfidfTransformer\n",
        "from sklearn.model_selection import train_test_split\n",
        "from sklearn.naive_bayes import MultinomialNB\n",
        "import torch.optim as optim\n",
        "import codecs\n",
        "import tqdm\n",
        "import nltk\n",
        "import re\n",
        "from gensim.models import Word2Vec\n",
        "import multiprocessing"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uWs9IykGpCJQ"
      },
      "source": [
        "# Set torch seed and device\n",
        "torch.manual_seed(SEED)\n",
        "torch.cuda.manual_seed(SEED)\n",
        "torch.backends.cudnn.deterministic = True\n",
        "\n",
        "use_cuda = torch.cuda.is_available()\n",
        "device = torch.device(\"cuda:0\" if use_cuda else \"cpu\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ctZiGPdWpCJI",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "7ecbcefb-a7e3-48cb-810b-64b067d8f336"
      },
      "source": [
        "# Download the task dataset\n",
        "# if not os.path.exists('/data'):\n",
        "!wget https://cs.rochester.edu/u/nhossain/semeval-2020-task-7-dataset.zip\n",
        "!unzip /content/semeval-2020-task-7-dataset.zip\n",
        "!rm semeval-2020-task-7-dataset.zip"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "--2021-03-03 11:10:15--  https://cs.rochester.edu/u/nhossain/semeval-2020-task-7-dataset.zip\n",
            "Resolving cs.rochester.edu (cs.rochester.edu)... 192.5.53.208\n",
            "Connecting to cs.rochester.edu (cs.rochester.edu)|192.5.53.208|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 1621456 (1.5M) [application/zip]\n",
            "Saving to: ‘semeval-2020-task-7-dataset.zip’\n",
            "\n",
            "semeval-2020-task-7 100%[===================>]   1.55M  2.57MB/s    in 0.6s    \n",
            "\n",
            "2021-03-03 11:10:17 (2.57 MB/s) - ‘semeval-2020-task-7-dataset.zip’ saved [1621456/1621456]\n",
            "\n",
            "Archive:  /content/semeval-2020-task-7-dataset.zip\n",
            "   creating: semeval-2020-task-7-dataset/\n",
            "  inflating: semeval-2020-task-7-dataset/.DS_Store  \n",
            "   creating: semeval-2020-task-7-dataset/subtask-1/\n",
            "  inflating: semeval-2020-task-7-dataset/subtask-1/train_funlines.csv  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-1/.DS_Store  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-1/test.csv  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-1/dev.csv  \n",
            " extracting: semeval-2020-task-7-dataset/subtask-1/baseline.zip  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-1/train.csv  \n",
            "  inflating: semeval-2020-task-7-dataset/README.txt  \n",
            "   creating: semeval-2020-task-7-dataset/subtask-2/\n",
            "  inflating: semeval-2020-task-7-dataset/subtask-2/train_funlines.csv  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-2/.DS_Store  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-2/test.csv  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-2/dev.csv  \n",
            " extracting: semeval-2020-task-7-dataset/subtask-2/baseline.zip  \n",
            "  inflating: semeval-2020-task-7-dataset/subtask-2/train.csv  \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ITI-7C13pCJS"
      },
      "source": [
        "# Load data \n",
        "train_df = pd.read_csv('semeval-2020-task-7-dataset/subtask-2/train.csv')\n",
        "dev_df = pd.read_csv('semeval-2020-task-7-dataset/subtask-2/dev.csv')\n",
        "test_df = pd.read_csv('semeval-2020-task-7-dataset/subtask-2/test.csv')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MlNWhyGYWMp_"
      },
      "source": [
        "### Pre-processing"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qGqeou9f0ZzD"
      },
      "source": [
        "#### Pre-process raw data (dataframes)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "S3UaF0Ji0dpd"
      },
      "source": [
        "def extract_data(df):\n",
        "    \"\"\"Get edited_text_1, edited_text_2, original_text, and label from raw dataframe\"\"\"\n",
        "\n",
        "    # Edited texts 1, x1\n",
        "    original_1 = df['original1']\n",
        "    edit_word_1 = df['edit1']\n",
        "    edit_1 = pd.Series([re.sub('<.*\\/>', e, s) for s, e in zip(original_1, edit_word_1)])\n",
        "\n",
        "    # Edited texts 2, x2\n",
        "    original_2 = df['original2']\n",
        "    edit_word_2 = df['edit2']\n",
        "    edit_2 = pd.Series([re.sub('<.*\\/>', e, s) for s, e in zip(original_2, edit_word_2)])\n",
        "\n",
        "    # Original texts, x3\n",
        "    # can be generated with either original_1 or original_2\n",
        "    original = pd.Series([re.sub('<|\\/>', '', s) for s in original_1]) \n",
        "\n",
        "    # Label, y in {0, 1, 2}\n",
        "    labels = df['label']\n",
        "\n",
        "    return edit_1, edit_2, original, labels"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5GXHc2OjWwRQ"
      },
      "source": [
        "#### Create vocabulary"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-opM9TDlpCJd"
      },
      "source": [
        "# To create our vocab\n",
        "def create_vocab(data, vocabulary=True):\n",
        "    \"\"\"\n",
        "    Creating a corpus of all the tokens used\n",
        "    \"\"\"\n",
        "    tokenized_corpus = [] # Let us put the tokenized corpus in a list\n",
        "\n",
        "    for sentence in data:\n",
        "\n",
        "        tokenized_sentence = []\n",
        "\n",
        "        for token in sentence.split(' '): # simplest split is\n",
        "\n",
        "            tokenized_sentence.append(token)\n",
        "\n",
        "        tokenized_corpus.append(tokenized_sentence)\n",
        "\n",
        "    # Return tokenized corpus only\n",
        "    if not vocabulary:\n",
        "        return [], tokenized_corpus\n",
        "\n",
        "    # Create single list of all vocabulary\n",
        "    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list\n",
        "\n",
        "    for sentence in tokenized_corpus:\n",
        "\n",
        "        for token in sentence:\n",
        "\n",
        "            if token not in vocabulary:\n",
        "\n",
        "                if True:\n",
        "                    vocabulary.append(token)\n",
        "\n",
        "    return vocabulary, tokenized_corpus"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "O-bb9Y5fGsUs",
        "outputId": "a7faa4d8-3c7e-4899-c25f-9ed9e89b3576"
      },
      "source": [
        "# Training, dev, and test data\n",
        "train_edit_1, train_edit_2, train_original, _ = extract_data(train_df)\n",
        "dev_edit_1, dev_edit_2, dev_original, _ = extract_data(dev_df)\n",
        "test_edit_1, test_edit_2, test_original, _ = extract_data(test_df)\n",
        "\n",
        "# Vocabs for training set\n",
        "_, train_tokenized_edit1_corpus = create_vocab(train_edit_1, vocabulary=False)\n",
        "_, train_tokenized_edit2_corpus = create_vocab(train_edit_2, vocabulary=False)\n",
        "train_vocab, train_tokenized_corpus = create_vocab(pd.concat([train_edit_1, train_edit_2, train_original]))\n",
        "\n",
        "print(\"Train vocab created.\")\n",
        "\n",
        "# Vocabs for dev set\n",
        "_, dev_tokenized_edit1_corpus = create_vocab(dev_edit_1, vocabulary=False)\n",
        "_, dev_tokenized_edit2_corpus = create_vocab(dev_edit_2, vocabulary=False)\n",
        "dev_vocab, dev_tokenized_corpus = create_vocab(pd.concat([dev_edit_1, dev_edit_2, dev_original]))\n",
        "\n",
        "print(\"Dev vocab created.\")\n",
        "\n",
        "# Vocabs for test set\n",
        "_, test_tokenized_edit1_corpus = create_vocab(test_edit_1, vocabulary=False)\n",
        "_, test_tokenized_edit2_corpus = create_vocab(test_edit_2, vocabulary=False)\n",
        "test_vocab, test_tokenized_corpus = create_vocab(pd.concat([test_edit_1, test_edit_2, test_original]))\n",
        "\n",
        "print(\"Test vocab created.\")\n",
        "\n",
        "# Creating joint vocab from dev and train:\n",
        "_, joint_tokenized_edit1_corpus = create_vocab(pd.concat([train_edit_1, dev_edit_1]))\n",
        "_, joint_tokenized_edit2_corpus = create_vocab(pd.concat([train_edit_2, dev_edit_2]))\n",
        "joint_vocab, joint_tokenized_corpus = create_vocab(\n",
        "    pd.concat([train_edit_1, train_edit_2, train_original, dev_edit_1, dev_edit_2, dev_original])\n",
        ")\n",
        "print(\"Vocab created.\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Train vocab created.\n",
            "Dev vocab created.\n",
            "Test vocab created.\n",
            "Vocab created.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "fbr2aa5QMd2H",
        "outputId": "a2839c37-ba7a-44ac-e8f8-b1184ce0ace9"
      },
      "source": [
        "# Check correct joint corpus\n",
        "assert len(train_tokenized_edit1_corpus) + len(dev_tokenized_edit1_corpus) == len(joint_tokenized_edit1_corpus)\n",
        "assert len(train_tokenized_edit2_corpus) + len(dev_tokenized_edit2_corpus) == len(joint_tokenized_edit2_corpus)\n",
        "\n",
        "# Should be the same\n",
        "print(train_tokenized_edit1_corpus[:3])\n",
        "print(joint_tokenized_edit1_corpus[:3])\n",
        "\n",
        "# Should be the same\n",
        "dev_set_start_id = len(train_tokenized_edit1_corpus)\n",
        "print(dev_tokenized_edit1_corpus[:3])\n",
        "print(joint_tokenized_edit1_corpus[dev_set_start_id:dev_set_start_id+3])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[['\"', 'Gene', 'Cernan', ',', 'Last', 'Dancer', 'on', 'the', 'Moon', ',', 'Dies', 'at', '82', '\"'], ['\"', 'I', \"'m\", 'done', '\"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas'], ['\"', 'I', \"'m\", 'done', '\"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas']]\n",
            "[['\"', 'Gene', 'Cernan', ',', 'Last', 'Dancer', 'on', 'the', 'Moon', ',', 'Dies', 'at', '82', '\"'], ['\"', 'I', \"'m\", 'done', '\"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas'], ['\"', 'I', \"'m\", 'done', '\"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas']]\n",
            "[['\"', 'Nutella', 'brownies', '\"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread'], ['\"', 'Nutella', 'brownies', '\"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread'], ['\"', 'Nutella', 'sales', '\"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread']]\n",
            "[['\"', 'Nutella', 'brownies', '\"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread'], ['\"', 'Nutella', 'brownies', '\"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread'], ['\"', 'Nutella', 'sales', '\"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread']]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8WymN6OTt7Gw"
      },
      "source": [
        "#### Learn Word2Vec Embeddings with Brown new corpuses"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xwhODl6kuGYY",
        "outputId": "61f30a53-1015-46c2-d383-e7a8d1dd6f12"
      },
      "source": [
        "# Download Brown(news) corpus\n",
        "from nltk.corpus import brown\n",
        "nltk.download('brown')\n",
        "\n",
        "# Function to train custom word2vec\n",
        "def train_word_embedding(\n",
        "    sentences, \n",
        "    embedding_dim=EMBEDDING_DIM, \n",
        "    window=5, \n",
        "    negative=15, \n",
        "    iter=10, \n",
        "    workers=multiprocessing.cpu_count()\n",
        "    ):\n",
        "  \n",
        "  return Word2Vec(sentences, size=embedding_dim, window=window, negative=negative, iter=iter, workers=workers)\n",
        "\n",
        "\n",
        "# Train with Brown(news) corpus\n",
        "brown_news_text = brown.words(categories='news')\n",
        "brown_sentences = brown.sents()\n",
        "brown_wv = train_word_embedding(brown_sentences).wv"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[nltk_data] Downloading package brown to /root/nltk_data...\n",
            "[nltk_data]   Unzipping corpora/brown.zip.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iyOMIDgAM1dP"
      },
      "source": [
        "#### Create Embeddings"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Y3EYiQCdM1CL"
      },
      "source": [
        "def create_wvecs(vocab, tokenized_edit1_corpus, tokenized_edit2_corpus, word_vec):\n",
        "    \"\"\"\n",
        "    Create word embeddings:\n",
        "    uses custom learned embedding model\n",
        "    \"\"\"\n",
        "\n",
        "    wvecs = [] # word vectors\n",
        "    word2idx = [] # word2index\n",
        "    idx2word = []\n",
        "    index = 1\n",
        "    \n",
        "    for word in joint_vocab:\n",
        "      if word in word_vec.vocab:\n",
        "        vec = word_vec[word]\n",
        "        wvecs.append(vec)\n",
        "        word2idx.append((word, index))\n",
        "        idx2word.append((index, word))\n",
        "        index += 1\n",
        "\n",
        "    wvecs = np.array(wvecs)\n",
        "    word2idx = dict(word2idx)\n",
        "    idx2word = dict(idx2word)\n",
        "\n",
        "    # Feature 1: first version of edited title\n",
        "    vectorized_edit1_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in tokenized_edit1_corpus]\n",
        "    vectorized_edit1_seqs = [x if len(x) > 0 else [0] for x in vectorized_edit1_seqs] # avoid empty feature\n",
        "\n",
        "    # Feature 2: second version of edited title\n",
        "    vectorized_edit2_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in tokenized_edit2_corpus]\n",
        "    vectorized_edit2_seqs = [x if len(x) > 0 else [0] for x in vectorized_edit2_seqs] # avoid empty feature\n",
        "\n",
        "    return wvecs, word2idx, idx2word, vectorized_edit1_seqs, vectorized_edit2_seqs"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "jpBx19BkUBSC",
        "outputId": "00b11d55-8992-46f1-f514-939e3c383cc5"
      },
      "source": [
        "wvecs, word2idx, idx2word, vectorized_edit1_seqs, vectorized_edit2_seqs = create_wvecs(\n",
        "    joint_vocab,\n",
        "    joint_tokenized_edit1_corpus,\n",
        "    joint_tokenized_edit2_corpus,\n",
        "    brown_wv\n",
        ")\n",
        "\n",
        "test_wvecs, _, _, test_vectorized_edit1_seqs, test_vectorized_edit2_seqs = create_wvecs(\n",
        "    test_vocab,\n",
        "    test_tokenized_edit1_corpus,\n",
        "    test_tokenized_edit2_corpus,\n",
        "    brown_wv\n",
        ")\n",
        "\n",
        "# Check dimension match\n",
        "assert len(vectorized_edit1_seqs) == len(vectorized_edit2_seqs) == len(joint_tokenized_edit1_corpus) == len(joint_tokenized_edit2_corpus)\n",
        "assert len(test_vectorized_edit1_seqs) == len(test_vectorized_edit2_seqs) == len(test_tokenized_edit1_corpus) == len(test_tokenized_edit2_corpus)\n",
        "\n",
        "# Coverage report\n",
        "print(f\"[Train&Dev] Number of first edited title not captured by the embedding: {vectorized_edit1_seqs.count([0])}\")\n",
        "print(f\"[Train&Dev] Number of second edited title not captured by the embedding: {vectorized_edit2_seqs.count([0])}\")\n",
        "\n",
        "print(f\"[Test] Number of first edited title not captured by the embedding: {test_vectorized_edit1_seqs.count([0])}\")\n",
        "print(f\"[Test] Number of second edited title not captured by the embedding: {test_vectorized_edit2_seqs.count([0])}\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[Train&Dev] Number of first edited title not captured by the embedding: 13\n",
            "[Train&Dev] Number of second edited title not captured by the embedding: 9\n",
            "[Test] Number of first edited title not captured by the embedding: 0\n",
            "[Test] Number of second edited title not captured by the embedding: 0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jYis4uYGW0e5"
      },
      "source": [
        "#### Padding"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1oHN7atmpCJi"
      },
      "source": [
        "# Used for collating our observations into minibatches:\n",
        "def collate_fn_padd(batch):\n",
        "    '''\n",
        "    We add padding to our minibatches and create tensors for our model\n",
        "    '''\n",
        "\n",
        "    batch_feature1 = [f1 for f1, f2, l in batch]\n",
        "    batch_feature2 = [f2 for f1, f2, l in batch]\n",
        "    batch_labels = [l for f1, f2, l in batch]\n",
        "\n",
        "    batch_features1_len = [len(f) for f in batch_feature1]\n",
        "    batch_features2_len = [len(f) for f in batch_feature2]\n",
        "    batch_feature1_and_2_len = batch_features1_len + batch_features2_len\n",
        "\n",
        "    seq_tensor_1 = torch.zeros((len(batch), max(batch_feature1_and_2_len))).long()\n",
        "    seq_tensor_2 = torch.zeros((len(batch), max(batch_feature1_and_2_len))).long()\n",
        "\n",
        "    for idx, (seq, seqlen) in enumerate(zip(batch_feature1, batch_features1_len)):\n",
        "        seq_tensor_1[idx, :seqlen] = torch.LongTensor(seq)\n",
        "    \n",
        "    for idx, (seq, seqlen) in enumerate(zip(batch_feature2, batch_features2_len)):\n",
        "        seq_tensor_2[idx, :seqlen] = torch.LongTensor(seq)\n",
        "\n",
        "    batch_labels = torch.LongTensor(batch_labels)\n",
        "\n",
        "    return seq_tensor_1, seq_tensor_2, batch_labels\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "R12IslmZj5rN"
      },
      "source": [
        "### Datasets and Dataloaders"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WrlA84R7j9Sn"
      },
      "source": [
        "# We create a Dataset so we can create minibatches\n",
        "class Task2Dataset(Dataset):\n",
        "\n",
        "    def __init__(self, edit1, edit2, labels):\n",
        "        self.x1 = edit1 # first edited version of title {sentence 1}\n",
        "        self.x2 = edit2 # second edited version of title {sentence 2}\n",
        "        self.y = labels\n",
        "\n",
        "    def __len__(self):\n",
        "        return len(self.y)\n",
        "\n",
        "    def __getitem__(self, item):\n",
        "        return self.x1[item], self.x2[item], self.y[item]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Th-urKhLj_7v",
        "outputId": "d50d0fb2-f0bc-4e19-cf5d-628ae6a117cd"
      },
      "source": [
        "# Combination of train and dev sets\n",
        "feature_1 = vectorized_edit1_seqs # first edited version\n",
        "feature_2 = vectorized_edit2_seqs # second edited version\n",
        "label = pd.concat([train_df['label'], dev_df['label']], ignore_index=True).to_numpy()\n",
        "\n",
        "assert len(feature_1) == len(feature_2) == len(label)\n",
        "\n",
        "# Train and dev datasets\n",
        "train_and_dev = Task2Dataset(feature_1, feature_2, label)\n",
        "\n",
        "train_examples = round(len(train_and_dev)*train_proportion)\n",
        "dev_examples = len(train_and_dev) - train_examples\n",
        "\n",
        "train_dataset, dev_dataset = random_split(\n",
        "    train_and_dev,\n",
        "    (train_examples,dev_examples)\n",
        "    )\n",
        "\n",
        "# Data Loaders\n",
        "train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)\n",
        "dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)\n",
        "\n",
        "print(\"Dataloaders created.\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Dataloaders created.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gvF44kqjj5y-"
      },
      "source": [
        "### Model Training and Evaluation"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "El4-CHGN6NJT"
      },
      "source": [
        "# Conversion from label to prob, needed for binary cross-entropy calculation\n",
        "def label_to_prob(label):\n",
        "    \"\"\"\n",
        "    Conversion between classes {0: same, 1: left sentence, 2: right sentence},\n",
        "    to probability between [0,1], {class 0 -> 0.5, class 1 -> 1,class 2 -> 0}.\n",
        "    \"\"\"\n",
        "\n",
        "    # Two setences are the same\n",
        "    label[label==0] = 0.5\n",
        "\n",
        "    # Right sentece is better (note no need to transform class 1)\n",
        "    label[label==2] = 0\n",
        "\n",
        "    return label.float()\n",
        "\n",
        "def prob_to_label(label):\n",
        "    \"\"\"\n",
        "    Inverse operation of label_to_prob\n",
        "    \"\"\"\n",
        "\n",
        "    # Probability to 0 or 1, with 0.5 as threshold\n",
        "    label = label.round()\n",
        "\n",
        "    # Right sentece is better (note no need to transform class 1)\n",
        "    label[label==0] = 2\n",
        "\n",
        "    return label"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "u9mZ3V13pCJW"
      },
      "source": [
        "# We define our training loop\n",
        "def train(train_iter, dev_iter, model, number_epoch):\n",
        "    \"\"\"\n",
        "    Training loop for the model, which calls on eval to evaluate after each epoch\n",
        "    \"\"\"\n",
        "\n",
        "    print(\"Training model.\")\n",
        "\n",
        "    for epoch in range(1, number_epoch+1):\n",
        "        \n",
        "        model.train()\n",
        "        \n",
        "        epoch_loss = 0\n",
        "        epoch_correct = 0\n",
        "        no_observations = 0  # Observations used for training so far\n",
        "\n",
        "        for batch in train_iter:\n",
        "            # Get features and labels/targets\n",
        "            feature1, feature2, target = batch\n",
        "            feature1, feature2, target = feature1.to(device), feature2.to(device), target.to(device)            \n",
        "            target = label_to_prob(target)\n",
        "\n",
        "            # Forward pass (input both sentences)\n",
        "            prob = model(feature1, feature2) \n",
        "\n",
        "            # Calculate loss\n",
        "            optimizer.zero_grad()\n",
        "            loss = loss_fn(prob, target)\n",
        "\n",
        "            # Backward pass\n",
        "            loss.backward()\n",
        "            optimizer.step()\n",
        "            \n",
        "            # Record data\n",
        "            no_observations = no_observations + target.shape[0]\n",
        "            \n",
        "            pred_class = prob.detach().cpu().numpy().round()\n",
        "            correct, __ = model_performance(pred_class, target.detach().cpu().numpy())\n",
        "\n",
        "            epoch_loss += loss.item()*target.shape[0]\n",
        "            epoch_correct += correct\n",
        "\n",
        "        valid_loss, valid_acc, __, __ = eval(dev_iter, model)\n",
        "\n",
        "        epoch_loss, epoch_acc = epoch_loss / no_observations, epoch_correct / no_observations\n",
        "        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:e} | Train Accuracy: {epoch_acc:.2f} | \\\n",
        "        Val. Loss: {valid_loss:e} | Val. Accuracy: {valid_acc:.2f} |')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h4TIRKC0pCJZ"
      },
      "source": [
        "# We evaluate performance on our dev set\n",
        "def eval(data_iter, model):\n",
        "    \"\"\"\n",
        "    Evaluating model performance on the dev set\n",
        "    \"\"\"\n",
        "    model.eval()\n",
        "    epoch_loss = 0\n",
        "    epoch_correct = 0\n",
        "    pred_all = []\n",
        "    trg_all = []\n",
        "    no_observations = 0\n",
        "\n",
        "    with torch.no_grad():\n",
        "        for batch in data_iter:\n",
        "  \n",
        "            # Get features and labels/targets\n",
        "            feature1, feature2, target = batch\n",
        "            feature1, feature2, target = feature1.to(device), feature2.to(device), target.to(device)\n",
        "            target = label_to_prob(target)\n",
        "\n",
        "            # Forward pass\n",
        "            prob = model(feature1, feature2) # first and second version title\n",
        "            \n",
        "            # Calculate loss\n",
        "            loss = loss_fn(prob, target)\n",
        "\n",
        "            # Calculate number of correct preds, and acc\n",
        "            pred_class = prob.detach().cpu().numpy().round()\n",
        "            prd, trg = pred_class, target.detach().cpu().numpy()\n",
        "            correct, __ = model_performance(prd, trg)\n",
        "\n",
        "            # Recording\n",
        "            no_observations = no_observations + target.shape[0]\n",
        "            epoch_loss += loss.item()*target.shape[0]\n",
        "            epoch_correct += correct\n",
        "            pred_all.extend(prob) # soft score\n",
        "            trg_all.extend(trg)\n",
        "        \n",
        "    return epoch_loss/no_observations, epoch_correct/no_observations, np.array(pred_all), np.array(trg_all)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iI3iXWnMpCJb"
      },
      "source": [
        "# How we print the model performance\n",
        "def model_performance(output, target, print_output=False):\n",
        "    \"\"\"\n",
        "    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8\n",
        "    \"\"\"\n",
        "\n",
        "    correct_answers = (output == target) | (target == 0.5)\n",
        "    correct = sum(correct_answers)\n",
        "    acc = np.true_divide(correct,len(output))\n",
        "\n",
        "    if print_output:\n",
        "        print(f'| Acc: {acc:.2f} ')\n",
        "\n",
        "    return correct, acc"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_haGTYuJXVie"
      },
      "source": [
        "### Model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "dbPXNY0IxpPd"
      },
      "source": [
        "class CNNBlock(nn.Module):\n",
        "    \"\"\"\n",
        "        CNN Block (conv-norm-nonlinearity)\n",
        "    \"\"\"\n",
        "    def __init__(\n",
        "            self, \n",
        "            in_channels, \n",
        "            out_channels, \n",
        "            kernel_size, \n",
        "            stride, \n",
        "            padding, \n",
        "            groups=1, # for depth-wise conv\n",
        "    ):\n",
        "        super(CNNBlock, self).__init__()\n",
        "        self.conv = nn.Sequential(\n",
        "            # Conv\n",
        "            nn.Conv1d(\n",
        "                in_channels,\n",
        "                out_channels,\n",
        "                kernel_size,\n",
        "                stride,\n",
        "                padding,\n",
        "                groups=groups,\n",
        "                bias=False,\n",
        "            ),\n",
        "            # Batch-Norm\n",
        "            nn.BatchNorm1d(out_channels),\n",
        "            # Non-linearity \n",
        "            nn.SiLU()\n",
        "        )\n",
        "\n",
        "    def forward(self, x):\n",
        "        return self.conv(x)\n",
        "\n",
        "class SqueezeExcitation(nn.Module):\n",
        "    \"\"\"\n",
        "        Squeeze Excitation Layer\n",
        "    \"\"\"\n",
        "    def __init__(self, in_channels, reduced_dim):\n",
        "        super(SqueezeExcitation, self).__init__()\n",
        "        self.se = nn.Sequential(\n",
        "            nn.AdaptiveAvgPool1d(1), # C x H -> C x 1\n",
        "            nn.Conv1d(in_channels, reduced_dim, 1),\n",
        "            nn.SiLU(),\n",
        "            nn.Conv1d(reduced_dim, in_channels, 1),\n",
        "            nn.Sigmoid(),\n",
        "        )\n",
        "\n",
        "    def forward(self, x):\n",
        "        return x * self.se(x)\n",
        "\n",
        "class InvertedResidualBlock(nn.Module):\n",
        "    \"\"\"\n",
        "        Inverted Residual Block (ResBlock with Squeeze Excitation)\n",
        "    \"\"\"\n",
        "    def __init__(\n",
        "            self,\n",
        "            in_channels,\n",
        "            out_channels,\n",
        "            kernel_size,\n",
        "            stride,\n",
        "            padding,\n",
        "            expand_ratio,\n",
        "            reduction=4, # for squeeze excitation\n",
        "            survival_prob=0.8, # for stochastic depth\n",
        "    ):\n",
        "        super(InvertedResidualBlock, self).__init__()\n",
        "        \n",
        "        self.survival_prob = survival_prob\n",
        "        self.use_residual = in_channels == out_channels and stride == 1\n",
        "        hidden_dim = in_channels * expand_ratio\n",
        "        self.expand = in_channels != hidden_dim\n",
        "        reduced_dim = int(in_channels / reduction)\n",
        "\n",
        "        if self.expand:\n",
        "            self.expand_conv = CNNBlock(\n",
        "                in_channels, \n",
        "                hidden_dim, \n",
        "                kernel_size=3, \n",
        "                stride=1, \n",
        "                padding=1,\n",
        "            )\n",
        "\n",
        "        self.conv = nn.Sequential(\n",
        "            CNNBlock(\n",
        "                hidden_dim, \n",
        "                hidden_dim, \n",
        "                kernel_size, \n",
        "                stride, \n",
        "                padding, \n",
        "                groups=hidden_dim,\n",
        "            ),\n",
        "            SqueezeExcitation(hidden_dim, reduced_dim),\n",
        "            nn.Conv1d(hidden_dim, out_channels, 1, bias=False),\n",
        "            nn.BatchNorm1d(out_channels),\n",
        "        )\n",
        "\n",
        "    def stochastic_depth(self, x):\n",
        "        if not self.training:\n",
        "            return x\n",
        "\n",
        "        binary_tensor = torch.rand(x.shape[0], 1, 1, device=x.device) < self.survival_prob\n",
        "        return torch.div(x, self.survival_prob) * binary_tensor\n",
        "\n",
        "    def forward(self, inputs):\n",
        "        x = self.expand_conv(inputs) if self.expand else inputs\n",
        "\n",
        "        if self.use_residual:\n",
        "            return self.stochastic_depth(self.conv(x)) + inputs\n",
        "        else:\n",
        "            return self.conv(x)\n",
        "\n",
        "\n",
        "class Net(nn.Module):\n",
        "    def __init__(self, embedding_dim, hidden_dim, vocab_size, dropout_rate=0.2):\n",
        "        super(Net, self).__init__()\n",
        "        \n",
        "        # Encoding dimension for each setence\n",
        "        last_channels = hidden_dim\n",
        "\n",
        "        # Layers (embbding - invertedResBlocks - pool - fc-with-sigmoid)\n",
        "        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)\n",
        "        self.convs = self.create_conv_layers(embedding_dim, last_channels)\n",
        "        self.pool = nn.AdaptiveAvgPool1d(1)\n",
        "        self.final = nn.Sequential(\n",
        "            nn.Dropout(dropout_rate),\n",
        "            nn.Linear(last_channels, 1),\n",
        "            nn.Sigmoid()\n",
        "        )\n",
        "\n",
        "    def create_conv_layers(self, in_channels, last_channels):\n",
        "        \n",
        "        # First convolution\n",
        "        channels = base_model[0][1] * 2\n",
        "        convs = [CNNBlock(in_channels, channels, 3, stride=2, padding=1)]\n",
        "        in_channels = channels\n",
        "\n",
        "        # Inverted residual block\n",
        "        for expand_ratio, out_channels, layers_repeats, stride, kernel_size in base_model:\n",
        "            for layer in range(layers_repeats):\n",
        "                convs.append(\n",
        "                    InvertedResidualBlock(\n",
        "                        in_channels,\n",
        "                        out_channels,\n",
        "                        expand_ratio=expand_ratio,\n",
        "                        stride = stride if layer == 0 else 1,\n",
        "                        kernel_size=kernel_size,\n",
        "                        padding=kernel_size//2, # if k=1:pad=0, k=3:pad=1, k=5:pad=2\n",
        "                    )\n",
        "                )\n",
        "                in_channels = out_channels\n",
        "\n",
        "        # Final convolution\n",
        "        convs.append(\n",
        "            CNNBlock(in_channels, last_channels, kernel_size=1, stride=1, padding=0)\n",
        "        )\n",
        "\n",
        "        return nn.Sequential(*convs)\n",
        "\n",
        "    def forward_one(self, sentence):\n",
        "        embedded = self.embedding(sentence)\n",
        "        embedded = embedded.permute(0, 2, 1)\n",
        "\n",
        "        out = self.convs(embedded)\n",
        "        out = self.pool(out)\n",
        "\n",
        "        return out.squeeze(2) # CHECK\n",
        "    \n",
        "    def forward(self, sentence1, sentence2):\n",
        "        \n",
        "        # High dimensional encoding of two sentences\n",
        "        out1 = self.forward_one(sentence1)\n",
        "        out2 = self.forward_one(sentence2)\n",
        "\n",
        "        # Difference of two encodings\n",
        "        diff = out1 - out2\n",
        "\n",
        "        return self.final(diff).squeeze(1)\n",
        "    \n",
        "    def predict(self, sentence1, sentence2):\n",
        "        \n",
        "        # Get embeddings\n",
        "        prob = self.forward(sentence1, sentence2)\n",
        "\n",
        "        # Output class during evaluation and inference\n",
        "        pred_class = (prob>0.5).float() # 0 or 1\n",
        "        pred_class[pred_class==0] = 2 # align with label\n",
        "\n",
        "        return pred_class"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Kj_nWr90krq9"
      },
      "source": [
        "### Training"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3tvo0qlzpCJq",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "a98575e8-88af-4967-a9a9-98d65313c237"
      },
      "source": [
        "## Approach 1 code, using functions defined above:\n",
        "VOCAB_DIM = len(word2idx)\n",
        "\n",
        "model = Net(EMBEDDING_DIM, ENCODING_DIM, VOCAB_DIM, DROPOUT_RATE)\n",
        "print(\"Model initialised.\")\n",
        "\n",
        "model.to(device)\n",
        "# We provide the model with our embeddings\n",
        "model.embedding.weight.data.copy_(torch.from_numpy(wvecs))\n",
        "\n",
        "loss_fn = nn.BCELoss()\n",
        "loss_fn = loss_fn.to(device)\n",
        "\n",
        "optimizer = torch.optim.Adam(model.parameters())\n",
        "\n",
        "train(train_loader, dev_loader, model, epochs)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Model initialised.\n",
            "Training model.\n",
            "| Epoch: 01 | Train Loss: 7.076608e-01 | Train Accuracy: 0.52 |         Val. Loss: 6.895622e-01 | Val. Accuracy: 0.55 |\n",
            "| Epoch: 02 | Train Loss: 6.928791e-01 | Train Accuracy: 0.54 |         Val. Loss: 6.901884e-01 | Val. Accuracy: 0.55 |\n",
            "| Epoch: 03 | Train Loss: 6.920110e-01 | Train Accuracy: 0.55 |         Val. Loss: 6.885698e-01 | Val. Accuracy: 0.55 |\n",
            "| Epoch: 04 | Train Loss: 6.870542e-01 | Train Accuracy: 0.56 |         Val. Loss: 6.864181e-01 | Val. Accuracy: 0.54 |\n",
            "| Epoch: 05 | Train Loss: 6.563194e-01 | Train Accuracy: 0.60 |         Val. Loss: 6.894650e-01 | Val. Accuracy: 0.58 |\n",
            "| Epoch: 06 | Train Loss: 5.777405e-01 | Train Accuracy: 0.68 |         Val. Loss: 6.804225e-01 | Val. Accuracy: 0.60 |\n",
            "| Epoch: 07 | Train Loss: 5.031203e-01 | Train Accuracy: 0.73 |         Val. Loss: 7.164076e-01 | Val. Accuracy: 0.61 |\n",
            "| Epoch: 08 | Train Loss: 4.463048e-01 | Train Accuracy: 0.77 |         Val. Loss: 7.474166e-01 | Val. Accuracy: 0.61 |\n",
            "| Epoch: 09 | Train Loss: 3.973365e-01 | Train Accuracy: 0.80 |         Val. Loss: 8.197799e-01 | Val. Accuracy: 0.62 |\n",
            "| Epoch: 10 | Train Loss: 3.707153e-01 | Train Accuracy: 0.81 |         Val. Loss: 8.397309e-01 | Val. Accuracy: 0.61 |\n",
            "| Epoch: 11 | Train Loss: 3.451648e-01 | Train Accuracy: 0.83 |         Val. Loss: 8.833163e-01 | Val. Accuracy: 0.62 |\n",
            "| Epoch: 12 | Train Loss: 3.258516e-01 | Train Accuracy: 0.83 |         Val. Loss: 8.616007e-01 | Val. Accuracy: 0.62 |\n",
            "| Epoch: 13 | Train Loss: 3.123936e-01 | Train Accuracy: 0.85 |         Val. Loss: 8.363313e-01 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 14 | Train Loss: 2.966473e-01 | Train Accuracy: 0.85 |         Val. Loss: 8.780544e-01 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 15 | Train Loss: 2.863476e-01 | Train Accuracy: 0.86 |         Val. Loss: 8.470365e-01 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 16 | Train Loss: 2.804985e-01 | Train Accuracy: 0.86 |         Val. Loss: 9.182052e-01 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 17 | Train Loss: 2.712931e-01 | Train Accuracy: 0.86 |         Val. Loss: 9.184508e-01 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 18 | Train Loss: 2.620280e-01 | Train Accuracy: 0.87 |         Val. Loss: 1.027506e+00 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 19 | Train Loss: 2.610210e-01 | Train Accuracy: 0.87 |         Val. Loss: 9.613541e-01 | Val. Accuracy: 0.62 |\n",
            "| Epoch: 20 | Train Loss: 2.497481e-01 | Train Accuracy: 0.87 |         Val. Loss: 9.299627e-01 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 21 | Train Loss: 2.443211e-01 | Train Accuracy: 0.87 |         Val. Loss: 1.175043e+00 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 22 | Train Loss: 2.400835e-01 | Train Accuracy: 0.88 |         Val. Loss: 1.078343e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 23 | Train Loss: 2.380020e-01 | Train Accuracy: 0.88 |         Val. Loss: 1.006826e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 24 | Train Loss: 2.276892e-01 | Train Accuracy: 0.88 |         Val. Loss: 1.142239e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 25 | Train Loss: 2.242230e-01 | Train Accuracy: 0.89 |         Val. Loss: 1.006627e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 26 | Train Loss: 2.225512e-01 | Train Accuracy: 0.89 |         Val. Loss: 1.073522e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 27 | Train Loss: 2.198651e-01 | Train Accuracy: 0.89 |         Val. Loss: 1.063618e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 28 | Train Loss: 2.204948e-01 | Train Accuracy: 0.88 |         Val. Loss: 1.111030e+00 | Val. Accuracy: 0.64 |\n",
            "| Epoch: 29 | Train Loss: 2.046121e-01 | Train Accuracy: 0.89 |         Val. Loss: 1.292199e+00 | Val. Accuracy: 0.63 |\n",
            "| Epoch: 30 | Train Loss: 2.061463e-01 | Train Accuracy: 0.89 |         Val. Loss: 1.060312e+00 | Val. Accuracy: 0.64 |\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cxffbNcGxZ_m"
      },
      "source": [
        "### Testing"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tr-3RFYbxdr9"
      },
      "source": [
        "#### Prepare test datasets and dataloaders"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "75gXHmmbxgeU"
      },
      "source": [
        "# Test sets\n",
        "feature_1 = test_vectorized_edit1_seqs # first edited version\n",
        "feature_2 = test_vectorized_edit2_seqs # second edited version\n",
        "label = test_df['label'].to_numpy()\n",
        "\n",
        "assert len(feature_1) == len(feature_2) == len(label)\n",
        "\n",
        "# Test datasets\n",
        "test_dataset = Task2Dataset(feature_1, feature_2, label)\n",
        "\n",
        "# Data Loaders\n",
        "test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "YOcYPyzfxjvW",
        "outputId": "ff9d3a7a-66bf-4f35-a585-f837c84998f5"
      },
      "source": [
        "# Evaluation\n",
        "eval(test_loader, model)\n",
        "test_loss, test_acc, __, __ = eval(test_loader, model)\n",
        "print(f'| Test. Loss: {test_loss:e} | Test. Accuracy: {test_acc:.2f} |')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "| Test. Loss: 1.463369e+00 | Test. Accuracy: 0.53 |\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1BAoHfeOpCJ3"
      },
      "source": [
        "#### Baseline for task 2"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "p0MwFSbSpCJ4",
        "outputId": "1de243d0-2f12-4dc6-ac64-333d4510637e"
      },
      "source": [
        "# Baseline for the task\n",
        "pred_baseline = torch.zeros(len(dev_y)) + 1  # 1 is most common class\n",
        "print(\"\\nBaseline performance:\")\n",
        "sse, mse = model_performance(pred_baseline, torch.tensor(dev_y.values), True)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "\n",
            "Baseline performance:\n",
            "| Acc: 0.45 \n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}