What to do:
tokenisation
- auto tokeniser
data augmentation / preprocessing
- synonym (wordnet) (azhara)
- remove all caps
- add text to samples
- back translation (ella)
- feature space synonym replacemnet (emily)
- hyperparameter tuning on roberta-base azhara
- learning rate [0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01]
- optimizer on [AdamW, Adafactor]
ella
- early stopping (only req for longer epochs)
- num_train_epochs [1, 5, 10, 15, 20]
emily
- train_batch_size [8, 16, 32, 64, 128]
- scheduler ["linear_schedule_with_warmup", "polynomial_decay_schedule_with_warmup", "constant_schedule_with_warmup"]
-
cased or uncased
-
augmentation parameter tuning
-
percentage of word embeddings replaced in BERT (em)
- how much percentage of all sentences
-
synonym (azhara)
- percentage of words replacing
-
back translation (ella)
- which languages, and amount of languages
- other
-
model ["facebook/bart-large-cnn", "distilroberta-base", "bert-base-cased"]
-
do a larger/smaller version for each model above
-
evaluate