## Approach 2: No pre-trained representations

### Coursework coding instructions (please also see full coursework spec)

Please choose if you want to do either Task 1 or Task 2. You should write your report about one task only.

For the task you choose you will need to do two approaches:
  - Approach 1, which can use use pre-trained embeddings / models
  - Approach 2, which should not use any pre-trained embeddings or models
We should be able to run both approaches from the same colab file

#### Running your code:
  - Your models should run automatically when running your colab file without further intervention
  - For each task you should automatically output the performance of both models
  - Your code should automatically download any libraries required

#### Structure of your code:
  - You are expected to use the 'train', 'eval' and 'model_performance' functions, although you may edit these as required
  - Otherwise there are no restrictions on what you can do in your code

#### Documentation:
  - You are expected to produce a .README file summarising how you have approached both tasks

#### Reproducibility:
  - Your .README file should explain how to replicate the different experiments mentioned in your report

Good luck! We are really looking forward to seeing your reports and your model code!

### Setup

In [None]:
# Environment variables
data_dir = "./data"
model_dir = "./models"

# Random Seed
SEED = 1

# Model Architexture
base_model = [
    # expand_ratio, channels, repeats, stride, kernel_size
    [1, 256, 1, 1, 3],
    [3, 300, 2, 2, 3],
    [3, 512, 2, 2, 3],
]

# Model Hyperparameters
BATCH_SIZE = 32
DROPOUT_RATE = 0.3
EMBEDDING_DIM = 100 # word vect embedding dimension
ENCODING_DIM = 1024 # hidden dimension of final pooling layer

# Number of Epochs
epochs = 30

# Proportion of training data for train compared to dev
train_proportion = 0.8

In [None]:
# Imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from torch.utils.data import Dataset, random_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
import torch.optim as optim
import codecs
import tqdm
import nltk
import re
from gensim.models import Word2Vec
import multiprocessing

In [None]:
# Set torch seed and device
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

use_cuda = torch.cuda.is_available()
device = torch.device("cuda:0" if use_cuda else "cpu")

In [None]:
# Download the task dataset
# if not os.path.exists('/data'):
!wget https://cs.rochester.edu/u/nhossain/semeval-2020-task-7-dataset.zip
!unzip /content/semeval-2020-task-7-dataset.zip
!rm semeval-2020-task-7-dataset.zip

--2021-03-03 11:10:15--  https://cs.rochester.edu/u/nhossain/semeval-2020-task-7-dataset.zip
Resolving cs.rochester.edu (cs.rochester.edu)... 192.5.53.208
Connecting to cs.rochester.edu (cs.rochester.edu)|192.5.53.208|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1621456 (1.5M) [application/zip]
Saving to: ‘semeval-2020-task-7-dataset.zip’


2021-03-03 11:10:17 (2.57 MB/s) - ‘semeval-2020-task-7-dataset.zip’ saved [1621456/1621456]

Archive:  /content/semeval-2020-task-7-dataset.zip
   creating: semeval-2020-task-7-dataset/
  inflating: semeval-2020-task-7-dataset/.DS_Store  
   creating: semeval-2020-task-7-dataset/subtask-1/
  inflating: semeval-2020-task-7-dataset/subtask-1/train_funlines.csv  
  inflating: semeval-2020-task-7-dataset/subtask-1/.DS_Store  
  inflating: semeval-2020-task-7-dataset/subtask-1/test.csv  
  inflating: semeval-2020-task-7-dataset/subtask-1/dev.csv  
 extracting: semeval-2020-task-7-dataset/subtask-1/baseline.zip  
  inflating: 

In [None]:
# Load data 
train_df = pd.read_csv('semeval-2020-task-7-dataset/subtask-2/train.csv')
dev_df = pd.read_csv('semeval-2020-task-7-dataset/subtask-2/dev.csv')
test_df = pd.read_csv('semeval-2020-task-7-dataset/subtask-2/test.csv')

### Pre-processing

#### Pre-process raw data (dataframes)

In [None]:
def extract_data(df):
    """Get edited_text_1, edited_text_2, original_text, and label from raw dataframe"""

    # Edited texts 1, x1
    original_1 = df['original1']
    edit_word_1 = df['edit1']
    edit_1 = pd.Series([re.sub('<.*\/>', e, s) for s, e in zip(original_1, edit_word_1)])

    # Edited texts 2, x2
    original_2 = df['original2']
    edit_word_2 = df['edit2']
    edit_2 = pd.Series([re.sub('<.*\/>', e, s) for s, e in zip(original_2, edit_word_2)])

    # Original texts, x3
    # can be generated with either original_1 or original_2
    original = pd.Series([re.sub('<|\/>', '', s) for s in original_1]) 

    # Label, y in {0, 1, 2}
    labels = df['label']

    return edit_1, edit_2, original, labels

#### Create vocabulary

In [None]:
# To create our vocab
def create_vocab(data, vocabulary=True):
    """
    Creating a corpus of all the tokens used
    """
    tokenized_corpus = [] # Let us put the tokenized corpus in a list

    for sentence in data:

        tokenized_sentence = []

        for token in sentence.split(' '): # simplest split is

            tokenized_sentence.append(token)

        tokenized_corpus.append(tokenized_sentence)

    # Return tokenized corpus only
    if not vocabulary:
        return [], tokenized_corpus

    # Create single list of all vocabulary
    vocabulary = []  # Let us put all the tokens (mostly words) appearing in the vocabulary in a list

    for sentence in tokenized_corpus:

        for token in sentence:

            if token not in vocabulary:

                if True:
                    vocabulary.append(token)

    return vocabulary, tokenized_corpus

In [None]:
# Training, dev, and test data
train_edit_1, train_edit_2, train_original, _ = extract_data(train_df)
dev_edit_1, dev_edit_2, dev_original, _ = extract_data(dev_df)
test_edit_1, test_edit_2, test_original, _ = extract_data(test_df)

# Vocabs for training set
_, train_tokenized_edit1_corpus = create_vocab(train_edit_1, vocabulary=False)
_, train_tokenized_edit2_corpus = create_vocab(train_edit_2, vocabulary=False)
train_vocab, train_tokenized_corpus = create_vocab(pd.concat([train_edit_1, train_edit_2, train_original]))

print("Train vocab created.")

# Vocabs for dev set
_, dev_tokenized_edit1_corpus = create_vocab(dev_edit_1, vocabulary=False)
_, dev_tokenized_edit2_corpus = create_vocab(dev_edit_2, vocabulary=False)
dev_vocab, dev_tokenized_corpus = create_vocab(pd.concat([dev_edit_1, dev_edit_2, dev_original]))

print("Dev vocab created.")

# Vocabs for test set
_, test_tokenized_edit1_corpus = create_vocab(test_edit_1, vocabulary=False)
_, test_tokenized_edit2_corpus = create_vocab(test_edit_2, vocabulary=False)
test_vocab, test_tokenized_corpus = create_vocab(pd.concat([test_edit_1, test_edit_2, test_original]))

print("Test vocab created.")

# Creating joint vocab from dev and train:
_, joint_tokenized_edit1_corpus = create_vocab(pd.concat([train_edit_1, dev_edit_1]))
_, joint_tokenized_edit2_corpus = create_vocab(pd.concat([train_edit_2, dev_edit_2]))
joint_vocab, joint_tokenized_corpus = create_vocab(
    pd.concat([train_edit_1, train_edit_2, train_original, dev_edit_1, dev_edit_2, dev_original])
)
print("Vocab created.")

Train vocab created.
Dev vocab created.
Test vocab created.
Vocab created.


In [None]:
# Check correct joint corpus
assert len(train_tokenized_edit1_corpus) + len(dev_tokenized_edit1_corpus) == len(joint_tokenized_edit1_corpus)
assert len(train_tokenized_edit2_corpus) + len(dev_tokenized_edit2_corpus) == len(joint_tokenized_edit2_corpus)

# Should be the same
print(train_tokenized_edit1_corpus[:3])
print(joint_tokenized_edit1_corpus[:3])

# Should be the same
dev_set_start_id = len(train_tokenized_edit1_corpus)
print(dev_tokenized_edit1_corpus[:3])
print(joint_tokenized_edit1_corpus[dev_set_start_id:dev_set_start_id+3])

[['"', 'Gene', 'Cernan', ',', 'Last', 'Dancer', 'on', 'the', 'Moon', ',', 'Dies', 'at', '82', '"'], ['"', 'I', "'m", 'done', '"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas'], ['"', 'I', "'m", 'done', '"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas']]
[['"', 'Gene', 'Cernan', ',', 'Last', 'Dancer', 'on', 'the', 'Moon', ',', 'Dies', 'at', '82', '"'], ['"', 'I', "'m", 'done', '"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas'], ['"', 'I', "'m", 'done', '"', ':', 'Fed', 'up', 'with', 'California', ',', 'some', 'vagrants', 'look', 'to', 'Texas']]
[['"', 'Nutella', 'brownies', '"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread'], ['"', 'Nutella', 'brownies', '"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread'], ['"', 'Nutella', 'sales', '"', 'erupt', 'in', 'France', 'over', 'discounted', 'chocolate', 'spread']]
[['"', 'Nut

#### Learn Word2Vec Embeddings with Brown new corpuses

In [None]:
# Download Brown(news) corpus
from nltk.corpus import brown
nltk.download('brown')

# Function to train custom word2vec
def train_word_embedding(
    sentences, 
    embedding_dim=EMBEDDING_DIM, 
    window=5, 
    negative=15, 
    iter=10, 
    workers=multiprocessing.cpu_count()
    ):
  
  return Word2Vec(sentences, size=embedding_dim, window=window, negative=negative, iter=iter, workers=workers)


# Train with Brown(news) corpus
brown_news_text = brown.words(categories='news')
brown_sentences = brown.sents()
brown_wv = train_word_embedding(brown_sentences).wv

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


#### Create Embeddings

In [None]:
def create_wvecs(vocab, tokenized_edit1_corpus, tokenized_edit2_corpus, word_vec):
    """
    Create word embeddings:
    uses custom learned embedding model
    """

    wvecs = [] # word vectors
    word2idx = [] # word2index
    idx2word = []
    index = 1
    
    for word in joint_vocab:
      if word in word_vec.vocab:
        vec = word_vec[word]
        wvecs.append(vec)
        word2idx.append((word, index))
        idx2word.append((index, word))
        index += 1

    wvecs = np.array(wvecs)
    word2idx = dict(word2idx)
    idx2word = dict(idx2word)

    # Feature 1: first version of edited title
    vectorized_edit1_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in tokenized_edit1_corpus]
    vectorized_edit1_seqs = [x if len(x) > 0 else [0] for x in vectorized_edit1_seqs] # avoid empty feature

    # Feature 2: second version of edited title
    vectorized_edit2_seqs = [[word2idx[tok] for tok in seq if tok in word2idx] for seq in tokenized_edit2_corpus]
    vectorized_edit2_seqs = [x if len(x) > 0 else [0] for x in vectorized_edit2_seqs] # avoid empty feature

    return wvecs, word2idx, idx2word, vectorized_edit1_seqs, vectorized_edit2_seqs

In [None]:
wvecs, word2idx, idx2word, vectorized_edit1_seqs, vectorized_edit2_seqs = create_wvecs(
    joint_vocab,
    joint_tokenized_edit1_corpus,
    joint_tokenized_edit2_corpus,
    brown_wv
)

test_wvecs, _, _, test_vectorized_edit1_seqs, test_vectorized_edit2_seqs = create_wvecs(
    test_vocab,
    test_tokenized_edit1_corpus,
    test_tokenized_edit2_corpus,
    brown_wv
)

# Check dimension match
assert len(vectorized_edit1_seqs) == len(vectorized_edit2_seqs) == len(joint_tokenized_edit1_corpus) == len(joint_tokenized_edit2_corpus)
assert len(test_vectorized_edit1_seqs) == len(test_vectorized_edit2_seqs) == len(test_tokenized_edit1_corpus) == len(test_tokenized_edit2_corpus)

# Coverage report
print(f"[Train&Dev] Number of first edited title not captured by the embedding: {vectorized_edit1_seqs.count([0])}")
print(f"[Train&Dev] Number of second edited title not captured by the embedding: {vectorized_edit2_seqs.count([0])}")

print(f"[Test] Number of first edited title not captured by the embedding: {test_vectorized_edit1_seqs.count([0])}")
print(f"[Test] Number of second edited title not captured by the embedding: {test_vectorized_edit2_seqs.count([0])}")

[Train&Dev] Number of first edited title not captured by the embedding: 13
[Train&Dev] Number of second edited title not captured by the embedding: 9
[Test] Number of first edited title not captured by the embedding: 0
[Test] Number of second edited title not captured by the embedding: 0


#### Padding

In [None]:
# Used for collating our observations into minibatches:
def collate_fn_padd(batch):
    '''
    We add padding to our minibatches and create tensors for our model
    '''

    batch_feature1 = [f1 for f1, f2, l in batch]
    batch_feature2 = [f2 for f1, f2, l in batch]
    batch_labels = [l for f1, f2, l in batch]

    batch_features1_len = [len(f) for f in batch_feature1]
    batch_features2_len = [len(f) for f in batch_feature2]
    batch_feature1_and_2_len = batch_features1_len + batch_features2_len

    seq_tensor_1 = torch.zeros((len(batch), max(batch_feature1_and_2_len))).long()
    seq_tensor_2 = torch.zeros((len(batch), max(batch_feature1_and_2_len))).long()

    for idx, (seq, seqlen) in enumerate(zip(batch_feature1, batch_features1_len)):
        seq_tensor_1[idx, :seqlen] = torch.LongTensor(seq)
    
    for idx, (seq, seqlen) in enumerate(zip(batch_feature2, batch_features2_len)):
        seq_tensor_2[idx, :seqlen] = torch.LongTensor(seq)

    batch_labels = torch.LongTensor(batch_labels)

    return seq_tensor_1, seq_tensor_2, batch_labels


### Datasets and Dataloaders

In [None]:
# We create a Dataset so we can create minibatches
class Task2Dataset(Dataset):

    def __init__(self, edit1, edit2, labels):
        self.x1 = edit1 # first edited version of title {sentence 1}
        self.x2 = edit2 # second edited version of title {sentence 2}
        self.y = labels

    def __len__(self):
        return len(self.y)

    def __getitem__(self, item):
        return self.x1[item], self.x2[item], self.y[item]

In [None]:
# Combination of train and dev sets
feature_1 = vectorized_edit1_seqs # first edited version
feature_2 = vectorized_edit2_seqs # second edited version
label = pd.concat([train_df['label'], dev_df['label']], ignore_index=True).to_numpy()

assert len(feature_1) == len(feature_2) == len(label)

# Train and dev datasets
train_and_dev = Task2Dataset(feature_1, feature_2, label)

train_examples = round(len(train_and_dev)*train_proportion)
dev_examples = len(train_and_dev) - train_examples

train_dataset, dev_dataset = random_split(
    train_and_dev,
    (train_examples,dev_examples)
    )

# Data Loaders
train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=True, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)
dev_loader = torch.utils.data.DataLoader(dev_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

print("Dataloaders created.")

Dataloaders created.


### Model Training and Evaluation

In [None]:
# Conversion from label to prob, needed for binary cross-entropy calculation
def label_to_prob(label):
    """
    Conversion between classes {0: same, 1: left sentence, 2: right sentence},
    to probability between [0,1], {class 0 -> 0.5, class 1 -> 1,class 2 -> 0}.
    """

    # Two setences are the same
    label[label==0] = 0.5

    # Right sentece is better (note no need to transform class 1)
    label[label==2] = 0

    return label.float()

def prob_to_label(label):
    """
    Inverse operation of label_to_prob
    """

    # Probability to 0 or 1, with 0.5 as threshold
    label = label.round()

    # Right sentece is better (note no need to transform class 1)
    label[label==0] = 2

    return label

In [None]:
# We define our training loop
def train(train_iter, dev_iter, model, number_epoch):
    """
    Training loop for the model, which calls on eval to evaluate after each epoch
    """

    print("Training model.")

    for epoch in range(1, number_epoch+1):
        
        model.train()
        
        epoch_loss = 0
        epoch_correct = 0
        no_observations = 0  # Observations used for training so far

        for batch in train_iter:
            # Get features and labels/targets
            feature1, feature2, target = batch
            feature1, feature2, target = feature1.to(device), feature2.to(device), target.to(device)            
            target = label_to_prob(target)

            # Forward pass (input both sentences)
            prob = model(feature1, feature2) 

            # Calculate loss
            optimizer.zero_grad()
            loss = loss_fn(prob, target)

            # Backward pass
            loss.backward()
            optimizer.step()
            
            # Record data
            no_observations = no_observations + target.shape[0]
            
            pred_class = prob.detach().cpu().numpy().round()
            correct, __ = model_performance(pred_class, target.detach().cpu().numpy())

            epoch_loss += loss.item()*target.shape[0]
            epoch_correct += correct

        valid_loss, valid_acc, __, __ = eval(dev_iter, model)

        epoch_loss, epoch_acc = epoch_loss / no_observations, epoch_correct / no_observations
        print(f'| Epoch: {epoch:02} | Train Loss: {epoch_loss:e} | Train Accuracy: {epoch_acc:.2f} | \
        Val. Loss: {valid_loss:e} | Val. Accuracy: {valid_acc:.2f} |')

In [None]:
# We evaluate performance on our dev set
def eval(data_iter, model):
    """
    Evaluating model performance on the dev set
    """
    model.eval()
    epoch_loss = 0
    epoch_correct = 0
    pred_all = []
    trg_all = []
    no_observations = 0

    with torch.no_grad():
        for batch in data_iter:
  
            # Get features and labels/targets
            feature1, feature2, target = batch
            feature1, feature2, target = feature1.to(device), feature2.to(device), target.to(device)
            target = label_to_prob(target)

            # Forward pass
            prob = model(feature1, feature2) # first and second version title
            
            # Calculate loss
            loss = loss_fn(prob, target)

            # Calculate number of correct preds, and acc
            pred_class = prob.detach().cpu().numpy().round()
            prd, trg = pred_class, target.detach().cpu().numpy()
            correct, __ = model_performance(prd, trg)

            # Recording
            no_observations = no_observations + target.shape[0]
            epoch_loss += loss.item()*target.shape[0]
            epoch_correct += correct
            pred_all.extend(prob) # soft score
            trg_all.extend(trg)
        
    return epoch_loss/no_observations, epoch_correct/no_observations, np.array(pred_all), np.array(trg_all)

In [None]:
# How we print the model performance
def model_performance(output, target, print_output=False):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    correct_answers = (output == target) | (target == 0.5)
    correct = sum(correct_answers)
    acc = np.true_divide(correct,len(output))

    if print_output:
        print(f'| Acc: {acc:.2f} ')

    return correct, acc

### Model

In [None]:
class CNNBlock(nn.Module):
    """
        CNN Block (conv-norm-nonlinearity)
    """
    def __init__(
            self, 
            in_channels, 
            out_channels, 
            kernel_size, 
            stride, 
            padding, 
            groups=1, # for depth-wise conv
    ):
        super(CNNBlock, self).__init__()
        self.conv = nn.Sequential(
            # Conv
            nn.Conv1d(
                in_channels,
                out_channels,
                kernel_size,
                stride,
                padding,
                groups=groups,
                bias=False,
            ),
            # Batch-Norm
            nn.BatchNorm1d(out_channels),
            # Non-linearity 
            nn.SiLU()
        )

    def forward(self, x):
        return self.conv(x)

class SqueezeExcitation(nn.Module):
    """
        Squeeze Excitation Layer
    """
    def __init__(self, in_channels, reduced_dim):
        super(SqueezeExcitation, self).__init__()
        self.se = nn.Sequential(
            nn.AdaptiveAvgPool1d(1), # C x H -> C x 1
            nn.Conv1d(in_channels, reduced_dim, 1),
            nn.SiLU(),
            nn.Conv1d(reduced_dim, in_channels, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return x * self.se(x)

class InvertedResidualBlock(nn.Module):
    """
        Inverted Residual Block (ResBlock with Squeeze Excitation)
    """
    def __init__(
            self,
            in_channels,
            out_channels,
            kernel_size,
            stride,
            padding,
            expand_ratio,
            reduction=4, # for squeeze excitation
            survival_prob=0.8, # for stochastic depth
    ):
        super(InvertedResidualBlock, self).__init__()
        
        self.survival_prob = survival_prob
        self.use_residual = in_channels == out_channels and stride == 1
        hidden_dim = in_channels * expand_ratio
        self.expand = in_channels != hidden_dim
        reduced_dim = int(in_channels / reduction)

        if self.expand:
            self.expand_conv = CNNBlock(
                in_channels, 
                hidden_dim, 
                kernel_size=3, 
                stride=1, 
                padding=1,
            )

        self.conv = nn.Sequential(
            CNNBlock(
                hidden_dim, 
                hidden_dim, 
                kernel_size, 
                stride, 
                padding, 
                groups=hidden_dim,
            ),
            SqueezeExcitation(hidden_dim, reduced_dim),
            nn.Conv1d(hidden_dim, out_channels, 1, bias=False),
            nn.BatchNorm1d(out_channels),
        )

    def stochastic_depth(self, x):
        if not self.training:
            return x

        binary_tensor = torch.rand(x.shape[0], 1, 1, device=x.device) < self.survival_prob
        return torch.div(x, self.survival_prob) * binary_tensor

    def forward(self, inputs):
        x = self.expand_conv(inputs) if self.expand else inputs

        if self.use_residual:
            return self.stochastic_depth(self.conv(x)) + inputs
        else:
            return self.conv(x)


class Net(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, dropout_rate=0.2):
        super(Net, self).__init__()
        
        # Encoding dimension for each setence
        last_channels = hidden_dim

        # Layers (embbding - invertedResBlocks - pool - fc-with-sigmoid)
        self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.convs = self.create_conv_layers(embedding_dim, last_channels)
        self.pool = nn.AdaptiveAvgPool1d(1)
        self.final = nn.Sequential(
            nn.Dropout(dropout_rate),
            nn.Linear(last_channels, 1),
            nn.Sigmoid()
        )

    def create_conv_layers(self, in_channels, last_channels):
        
        # First convolution
        channels = base_model[0][1] * 2
        convs = [CNNBlock(in_channels, channels, 3, stride=2, padding=1)]
        in_channels = channels

        # Inverted residual block
        for expand_ratio, out_channels, layers_repeats, stride, kernel_size in base_model:
            for layer in range(layers_repeats):
                convs.append(
                    InvertedResidualBlock(
                        in_channels,
                        out_channels,
                        expand_ratio=expand_ratio,
                        stride = stride if layer == 0 else 1,
                        kernel_size=kernel_size,
                        padding=kernel_size//2, # if k=1:pad=0, k=3:pad=1, k=5:pad=2
                    )
                )
                in_channels = out_channels

        # Final convolution
        convs.append(
            CNNBlock(in_channels, last_channels, kernel_size=1, stride=1, padding=0)
        )

        return nn.Sequential(*convs)

    def forward_one(self, sentence):
        embedded = self.embedding(sentence)
        embedded = embedded.permute(0, 2, 1)

        out = self.convs(embedded)
        out = self.pool(out)

        return out.squeeze(2) # CHECK
    
    def forward(self, sentence1, sentence2):
        
        # High dimensional encoding of two sentences
        out1 = self.forward_one(sentence1)
        out2 = self.forward_one(sentence2)

        # Difference of two encodings
        diff = out1 - out2

        return self.final(diff).squeeze(1)
    
    def predict(self, sentence1, sentence2):
        
        # Get embeddings
        prob = self.forward(sentence1, sentence2)

        # Output class during evaluation and inference
        pred_class = (prob>0.5).float() # 0 or 1
        pred_class[pred_class==0] = 2 # align with label

        return pred_class

### Training

In [None]:
## Approach 1 code, using functions defined above:
VOCAB_DIM = len(word2idx)

model = Net(EMBEDDING_DIM, ENCODING_DIM, VOCAB_DIM, DROPOUT_RATE)
print("Model initialised.")

model.to(device)
# We provide the model with our embeddings
model.embedding.weight.data.copy_(torch.from_numpy(wvecs))

loss_fn = nn.BCELoss()
loss_fn = loss_fn.to(device)

optimizer = torch.optim.Adam(model.parameters())

train(train_loader, dev_loader, model, epochs)

Model initialised.
Training model.
| Epoch: 01 | Train Loss: 7.076608e-01 | Train Accuracy: 0.52 |         Val. Loss: 6.895622e-01 | Val. Accuracy: 0.55 |
| Epoch: 02 | Train Loss: 6.928791e-01 | Train Accuracy: 0.54 |         Val. Loss: 6.901884e-01 | Val. Accuracy: 0.55 |
| Epoch: 03 | Train Loss: 6.920110e-01 | Train Accuracy: 0.55 |         Val. Loss: 6.885698e-01 | Val. Accuracy: 0.55 |
| Epoch: 04 | Train Loss: 6.870542e-01 | Train Accuracy: 0.56 |         Val. Loss: 6.864181e-01 | Val. Accuracy: 0.54 |
| Epoch: 05 | Train Loss: 6.563194e-01 | Train Accuracy: 0.60 |         Val. Loss: 6.894650e-01 | Val. Accuracy: 0.58 |
| Epoch: 06 | Train Loss: 5.777405e-01 | Train Accuracy: 0.68 |         Val. Loss: 6.804225e-01 | Val. Accuracy: 0.60 |
| Epoch: 07 | Train Loss: 5.031203e-01 | Train Accuracy: 0.73 |         Val. Loss: 7.164076e-01 | Val. Accuracy: 0.61 |
| Epoch: 08 | Train Loss: 4.463048e-01 | Train Accuracy: 0.77 |         Val. Loss: 7.474166e-01 | Val. Accuracy: 0.61 |
| Epo

### Testing

#### Prepare test datasets and dataloaders

In [None]:
# Test sets
feature_1 = test_vectorized_edit1_seqs # first edited version
feature_2 = test_vectorized_edit2_seqs # second edited version
label = test_df['label'].to_numpy()

assert len(feature_1) == len(feature_2) == len(label)

# Test datasets
test_dataset = Task2Dataset(feature_1, feature_2, label)

# Data Loaders
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_fn_padd)

In [None]:
# Evaluation
eval(test_loader, model)
test_loss, test_acc, __, __ = eval(test_loader, model)
print(f'| Test. Loss: {test_loss:e} | Test. Accuracy: {test_acc:.2f} |')

| Test. Loss: 1.463369e+00 | Test. Accuracy: 0.53 |


#### Baseline for task 2

In [None]:
# Baseline for the task
pred_baseline = torch.zeros(len(dev_y)) + 1  # 1 is most common class
print("\nBaseline performance:")
sse, mse = model_performance(pred_baseline, torch.tensor(dev_y.values), True)


Baseline performance:
| Acc: 0.45 
