Commit 11785997 authored by  Joel  Oksanen's avatar Joel Oksanen

System architecture and ADA app write up done

parent d72dd740
//
// ProductView.swift
// ADAbot
//
// Created by Joel Oksanen on 12.6.2020.
// Copyright © 2020 Joel Oksanen. All rights reserved.
//
import SwiftUI
struct ProductView: View {
@ObservedObject var connectionManager: ConnectionManager
var body: some View {
VStack(spacing: 0) {
ProductView(connectionManager: connectionManager)
.zIndex(10)
ChatView(connectionManager: connectionManager)
.zIndex(0)
}
.edgesIgnoringSafeArea(.all)
.background(Color.black)
}
}
//
// SearchView.swift
// ADAbot
//
// Created by Joel Oksanen on 12.6.2020.
// Copyright © 2020 Joel Oksanen. All rights reserved.
//
import Foundation
\chapter{Feature-dependent sentiment analysis}
\label{chap:sa}
Unlike for our ontology extraction task, there already exist sufficiently accurate methods for our feature-dependent sentiment analysis purposes. In this paper, we have chosen to implement the state-of-the-art method of \textit{TD-BERT} proposed by Gao et al.\ \cite{RefWorks:doc:5ed3c3bbe4b0445db7f0a369}, trained on the \textit{SemEval-2014 Task 4} \cite{RefWorks:doc:5e6d25bee4b0b5553c7b4dab} data for \textit{Aspect Based Sentiment Analysis}. Since the data is domain-specific, we experiment with modifications to TD-BERT in order to improve its general domain performance.
\section{Availability of data}
As opposed to its sentence-level counterpart, feature-dependent sentiment analysis is a word-level task, which means that obtaining data for it is highly difficult. In practise, the sentiment towards each entity mention in the data has to be manually labelled by several human annotators, as the correct label is not always evident. As training a neural network requires extensive amounts of training data, we do not have the resources to annotate this data ourselves, but have to rely on an external source of data.
The most well-known available data for the task comes from Task 4 of the SemEval-2014 evaluation series, which was based around feature-dependent sentiment analysis. The data\footnote{Available at http://alt.qcri.org/semeval2014/task4/index.php?id=data-and-tools} consists of manually annotated customer review sentences for laptops and restaurants, with the training and testing data counts shown in Table \ref{tab:semeval_data}. An example from the laptop dataset is shown in Table \ref{tab:semeval_data_example}.
\begin{table}[h]
\centering
{\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|c|c|c|}
\hline
\multirow{2}{*}{Dataset} & \multicolumn{2}{c|}{Number of instances} \\
\cline{2-3}
& training & testing \\
\hline \hline
laptops & 3045 & 800 \\
\hline
restaurants & 3041 & 800 \\
\hline
combined & 6086 & 1600 \\
\hline
\end{tabular}
}
\caption{SemEval-2014 Task 4 data counts}
\label{tab:semeval_data}
\end{table}
\begin{table}[H]
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
\texttt{sentence} & \texttt{term} & \texttt{polarity} & \texttt{from} & \texttt{to} \\
\hline \hline
\multirow{2}{*}{\makecell{"I charge it at night and skip taking the cord \\ with me because of the good battery life."}} & "cord" & "neutral" & 41 & 45 \\
\cline{2-5}
& "battery life" & "positive" & 74 & 86 \\
\hline
\end{tabular}
\caption{Example of SemEval-2014 Task 4 data}
\label{tab:semeval_data_example}
\end{table}
Each aspect term in the data is labelled with one of four labels according to the sentiment expressed towards it: \textit{neutral}, \textit{positive}, \textit{negative}, or \textit{conflict}. The \textit{conflict} label is given to terms such as \textit{screen} in \textit{small screen somewhat limiting but great for travel}, where both positive and negative sentiments are expressed. Only a small proportion of the training data is annotated with the \textit{conflict} label (45 for laptops), so it will be difficult for the classifier to learn to accurately predict this category.
This dataset suits our purposes as it is in the domain of user reviews, particularly the dataset for laptops, since they are consumer products that can be found on Amazon. However, the dataset is quite small for training a neural network, so we include also the restaurant dataset in our evaluation. Furthermore, our experiments in Section \ref{sec:general_eval} showed that training a classifier on just two domains can drastically improve its performance in the general domain.
\section{Implementation}
In this section, we will detail our implementation of the TD-BERT architecture. In addition, we will propose an extension similar to the masking method used in Sections \ref{sec:feature_extraction} and \ref{sec:ontology_extraction} in hopes of improving the general domain performance of TD-BERT.
\subsection{TD-BERT}
The architecture of TD-BERT, shown in Figure \ref{fig:tdBERT}, is similar to the two BERT architectures we have seen so far, in that it max pools the output at the positions of the tokens we are interested in. In this case, we select the output at the position(s) of the feature for which we wish to obtain a sentiment. The pooling is followed by a linear classification layer and a softmax operator, which outputs the the probabilities for the four possible cases: $p_{neut}$ for neutral sentiment, $p_{pos}$ for positive sentiment, $p_{neg}$ for negative sentiment, and $p_{conf}$ for conflicted sentiment.
\subsubsection{Masked TD-BERT}
Because we will be training the model with data for only laptop and restaurant reviews, we expect for it to be biased towards those domains. In attempt to alleviate this bias, we propose the addition of entity masking to the TD-BERT. As for the BERT models used for ontology construction, we replace the tokens for the entity with \texttt{[MASK]} tokens. For example, in Figure \ref{fig:tdBERT}, tokens \textit{battery} and \textit{life} would be masked.
\begin{figure}[H]
\centering
\includegraphics[width=12cm]{images/sentiment_bert.png}
\caption{BERT for feature extraction}
\label{fig:tdBERT}
\end{figure}
\subsubsection{Training the models}
We trained a total of four different models on the SemEval training data shown in Table \ref{tab:semeval_data}. We trained two models each for both unmasked and masked TD-BERT, one with the training data for just laptops, and one with the combined training dataset for both laptops and restaurants.
We used the same hyperparameters as Gao et al.: each model was trained for 6 epochs with a batch size of 32, using the Adam optimiser with standard cross entropy loss. The models were trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM.
\section{Evaluation}
We evaluate the models using the in-domain SemEval-2014 testing data shown in Table \ref{tab:semeval_data}, as well as a general domain review dataset we annotate ourselves for a selection of Amazon product reviews.
\subsection{Annotation of Amazon sentiment analysis data}
We write a simple program, shown in Figure \ref{fig:entity_annotator}, to help us annotate Amazon review data for the evaluation of the sentiment analysis models. Using this program we annotate a set of 100 Amazon reviews consisting of 20 reviews from five product categories: \textit{watches}, \textit{televisions}, \textit{necklaces}, \textit{stand mixers}, and \textit{video games}. We annotate a total of 285 entities across 481 sentences, obtaining the data shown in Table \ref{tab:amazon_sa_data}.
\\
\begin{table}[h]
\centering
{\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|c|c|}
\hline
Sentiment & Number of instances \\
\hline \hline
neutral & 71 \\
\hline
positive & 133 \\
\hline
negative & 65 \\
\hline
conflict & 16 \\
\hline
\end{tabular}
}
\caption{Amazon sentiment analysis evaluation data counts}
\label{tab:amazon_sa_data}
\end{table}
\begin{figure}[h]
\centering
\includegraphics[width=12cm]{images/sentiment_annotator.png}
\caption{Entity annotator interface}
\label{fig:entity_annotator}
\end{figure}
\subsection{Results}
We evaluate each of the four models on both the SemEval-2014 testing data as well as the Amazon data, and display the accuracies and macro F1-scores in Table \ref{tab:sa_results}. With the SemEval-2014 data, we use the laptop testing data to evaluate the models trained on the laptop training data, and similarly for the combined data, in order to evaluate the in-domain performance of the models.
\begin{table}[h]
\centering
{\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|c|c||c|c|c|c|}
\hline
\multirow{2}{*}{Method} & \multirow{2}{*}{Training data} & \multicolumn{2}{c|}{SemEval-2014} & \multicolumn{2}{c|}{Amazon} \\
\cline{3-6}
& & Accuracy & Macro-F1 & Accuracy & Macro-F1 \\
\hline \hline
\multirow{2}{*}{Unmasked} & laptops & 77.22 & 56.44 & 74.65 & 52.85 \\
\cline{2-6}
& combined & 79.92 & 55.84 & 78.52 & 58.41 \\
\hline
\multirow{2}{*}{Masked} & laptops & 75.54 & 54.79 & 77.82 & 58.26 \\
\cline{2-6}
& combined & 79.92 & 56.20 & 75.00 & 55.54 \\
\hline
\end{tabular}
}
\caption{TD-BERT sentiment analysis evaluation results}
\label{tab:sa_results}
\end{table}
We note that the macro F-1 scores are significantly lower than the accuracy scores for all cases. This is due to the little training data available for the \textit{conflict} category: none of the models predicted any \textit{conflict} labels so the accuracy for the category was zero.
\subsubsection{In-domain evaluation with SemEval-2016 data}
We note that for both models, performance is significantly higher when trained and tested on the combined set of laptops and restaurants as opposed to just laptops. This is likely due to restaurant reviews being an easier domain for sentiment classification. As expected, the unmasked method performs better than the masked method when trained and evaluated in-domain with laptops. For the combined case, the two methods performed equally well in terms of accuracy, while the masked method obtained a slightly higher Macro-F1 score. This echoes the claim from Section \ref{sec:general_eval} that the in-domain advantage is lost even with a domain of just two different products.
\subsubsection{Out-of-domain evaluation with Amazon data}
The model trained on the combined data with the unmasked method appears to perform best in the out-of-domain evaluation, achieving an accuracy of 78.52, which is not far behind its accuracy of 79.92 for the in-domain evaluation. The accuracy is significantly higher than that of the unmasked model trained on the laptop data alone (74.65), which suggests that the model generalises well with just two training domains.
The accuracy of the masked model trained on the laptop domain (77.82) is quite high as well, and significantly higher than the accuracy of the unmasked model trained on the same domain, indicating that for a single domain, the masking helps improve the generality of the model. However, the same is not true in the dual-domain, where the accuracy of the masked model drops to 75.00. This suggests that the model trained in a dual-domain has already reached its general-domain optimum (as proposed in Section \ref{sec:general_eval}), and therefore the masking only decreases its performance by hiding information.
Based on these results, we decide to use the unmasked model trained on the combined data for the sentiment analysis task in our ADA system.
\section{Exploration}
\subsection{Sentence-level SA trained on domain-specific data}
\subsection{Feature-dependent SA trained on SemEval-2014 data}
\section{Evaluation}
\ No newline at end of file
......@@ -3,6 +3,7 @@
We begin this chapter by detailing the methodology of the ADA proposed by Cocarascu et al.\ \cite{RefWorks:doc:5e08939de4b0912a82c3d46c}. We then evaluate the limitations of ADA in relation to Amazon reviews, and suggest extensions to address them. We then move onto considering current research in the fields of feature-level sentiment analysis and conversational systems in order to establish a basis for our enhancements to the agent.
\section{Argumentative Dialogical Agent}
\label{sec:ADA_bg}
ADA is designed around a \textit{feature}-based conceptualisation of products. Rather than simply reporting the general sentiment of reviewers towards a product, ADA builds a more intricate understanding of \textit{why} a product is (dis-)liked by inspecting the reviewers' sentiments towards its features. For example, rather than stating that a movie is mostly well-received, ADA can discern that reviewers appreciated its acting and themes, while its cinematography was found subpar.
......@@ -30,7 +31,7 @@ In the following sections, we will go through the different aspects of ADA in mo
ADA is designed around a \textit{feature-based characterisation} of products:
\begin{definition}[Feature-based characterisation]
Let $\mathcal{P}$ be a given set of products, and $p \in \mathcal{P}$ be any product. A feature-based characterisation of $p$ is a set $\mathcal{F}$ of features with sub-features $\mathcal{F}' \subset \mathcal{F}$ such that each $f' \in \mathcal{F}'$ has a unique parent $p(f') \in \mathcal{F}$; for any $f \in \mathcal{F} \backslash \mathcal{F}'$, we define $p(f) = p$.
A feature-based characterisation of a product $p$ is a set $\mathcal{F}$ of features with sub-features $\mathcal{F}' \subset \mathcal{F}$ such that each $f' \in \mathcal{F}'$ has a unique parent $p(f') \in \mathcal{F}$; for any $f \in \mathcal{F} \backslash \mathcal{F}'$, we define $p(f) = p$.
\end{definition}
This feature-based characterisation can be \textit{predetermined}, i.e. obtained from metadata. For example, for a digital camera $p$, we could use metadata to obtain $\mathcal{F} = \{f_I,f_L,f_B\}$, where $f_I$ is \textit{image quality}, $f_L$ is \textit{lens}, and $f_B$ is \textit{battery}, as these are features that are present in most, if not all, digital cameras. Features can also be \textit{mined} from the reviews themselves using more sophisticated methods, such as the semantic network ConceptNet\footnote{http://conceptnet.io/} to identify related terms to either the product or its known features, in order to obtain further sub-features. For our camera example, we could mine the sub-features $f_{L1}'$ for \textit{zoom} and $f_{L2}'$ for \textit{autofocus}. Figure \ref{fig:featurebasedrepresentation} shows the full feature-based representation for our running example.
......@@ -122,6 +123,7 @@ Online reviews are often brief: Amazon reviews have a median length of just 82 w
The augmentation populates the review aggregation by exploiting the parent relation of arguments and the fact that an argument can be seen as a sum of its sub-features. If an argument does not have a vote from a user $u$, but its sub-features have primarily positive (negative) votes from $u$, it will also gain a positive (negative) vote from $u$.
\subsection{Quantitative Bipolar Argumentation Framework}
\label{sec:QBAF}
ADA uses the review aggregation to generate a QBAF, which is a quadruple $\langle\mathcal{A},\mathcal{L}^-,\mathcal{L}^+,\tau\rangle$ consisting of a set $\mathcal{A}$ of arguments, binary child-parent relations $\mathcal{L}^-$ (attack) and $\mathcal{L}^+$ (support) on $\mathcal{A}$, and a total function $\tau : \mathcal{A} \to [0, 1]$ representing the \textit{base score} of the arguments. It is defined as follows:
......@@ -241,6 +243,7 @@ The QBAF in figure \ref{fig:QBAF} is annotated with DF-QuAD strengths for its ar
\subsection{Dialogical explanations}
\label{sec:dialogue}
\begin{figure}[b]
\begin{align*}
......@@ -270,7 +273,7 @@ ADA can generate dialogical explanations for arguments based on the extracted QB
\item else if $\alpha = p$ and $\sigma(\alpha) < 0.5$ and $\exists\beta\in\mathcal{L}^-(\alpha)\cup\mathcal{L}^+(\alpha)$ s.t. $\sigma(\beta) > 0$:
\begin{flalign*}
\mathcal{Q}(\alpha) &= \{\textrm{Why was $\alpha$ poorly rated?}\} &\\
\mathcal{X}(\alpha) &= \{\textrm{This product was poorly rated}\} + r_a^-(max(\mathcal{L}^-(\alpha))) + r_b^+(max(\mathcal{L}^+(\alpha)));
\mathcal{X}(\alpha) &= \{\textrm{This product was poorly rated}\} + r_a^-(max(\mathcal{L}^+(\alpha))) + r_b^+(max(\mathcal{L}^-(\alpha)));
\end{flalign*}
\item else if $\alpha \in \mathcal{F}$ and $\mathcal{V}^+(\alpha) > \mathcal{V}^-(\alpha)$ and $\exists\beta\in\mathcal{L}^-(\alpha)\cup\mathcal{L}^+(\alpha)$ s.t. $\sigma(\beta) > 0$:
\begin{flalign*}
......@@ -280,7 +283,7 @@ ADA can generate dialogical explanations for arguments based on the extracted QB
\item else if $\alpha \in \mathcal{F}$ and $\mathcal{V}^+(\alpha) < \mathcal{V}^-(\alpha)$ and $\exists\beta\in\mathcal{L}^-(\alpha)\cup\mathcal{L}^+(\alpha)$ s.t. $\sigma(\beta) > 0$:
\begin{flalign*}
\mathcal{Q}(\alpha) &= \{\textrm{Why was/were (the) $\alpha$ considered to be poor?}\} &\\
\mathcal{X}(\alpha) &= \{\textrm{(The) $\alpha$ was/were considered to be poor}\} + r_a^-(max(\mathcal{L}^-(\alpha))) + r_b^+(max(\mathcal{L}^+(\alpha)));
\mathcal{X}(\alpha) &= \{\textrm{(The) $\alpha$ was/were considered to be poor}\} + r_a^-(max(\mathcal{L}^+(\alpha))) + r_b^+(max(\mathcal{L}^-(\alpha)));
\end{flalign*}
\item else if $\mathcal{V}^+(\alpha) > \mathcal{V}^-(\alpha)$ and $\exists\beta\in\mathcal{L}^-(\alpha)\cup\mathcal{L}^+(\alpha)$ s.t. $\sigma(\beta) > 0$:
\begin{flalign*}
......@@ -419,9 +422,9 @@ Some methods have been proposed for \textit{automatic common-sense completion},
After we have extracted the features of a product, we wish to discern whether the opinions towards the features in reviews are positive or negative through sentiment analysis. Perhaps the main difficulty in feature-dependent sentiment analysis is to distinguish which opinions are acting on which arguments.
ADA attempts to tackle this issue by diving the review into phrases at specific keywords, such as the word \textit{but} in \textit{I liked the acting, but the cinematography was dreadful}, after which it assumes each phrase contains at most one sentiment. However, there are many cases where such a simple method will not work, like the example at the start of this section. This is particularly true for Amazon reviews where the language tends to be less formal compared to Rotten Tomatoes reviews written by professional critics.
ADA attempts to tackle this issue by dividing the review into phrases at specific keywords, such as the word \textit{but} in \textit{I liked the acting, but the cinematography was dreadful}, after which it assumes each phrase contains at most one sentiment. However, there are many cases where such a simple method will not work, like the example at the start of this section. This is particularly true for Amazon reviews where the language tends to be less formal compared to Rotten Tomatoes reviews written by professional critics.
More advanced methods using deep learning have been proposed in literature, although the task is deemed difficult and there is currently no dominating technique for this purpose \cite{RefWorks:doc:5e2b0d8de4b0711bafe4fba8}. Dong et al.\ \cite{RefWorks:doc:5e2e107ce4b0bc4691206e2e} proposed an \textit{adaptive recursive neural network} (AdaRNN) for target-dependent Twitter sentiment classification, which propagates the sentiments of words to the target by exploiting the context and the syntactic relationships between them. The results were compared with a re-implementation of \textit{SVM-dep} proposed by Jiang et al.\ \cite{RefWorks:doc:5e2e1e23e4b0e67b35d1c360}, which relies on target-dependent syntactic features in a SVM classifier instead of a neural network. The result for both methods are promising, and the domain of Twitter is similar to Amazon reviews in terms of formality.
Various machine learning methods have been proposed for feature-dependent sentiment analysis, although the task is deemed difficult and there is currently no dominating technique for this purpose \cite{RefWorks:doc:5e2b0d8de4b0711bafe4fba8}. More traditional approaches, such as \textit{SVM-dep} proposed by Jiang et al.\ \cite{RefWorks:doc:5e2e1e23e4b0e67b35d1c360}, use hand-crafted syntactic features alongside word embeddings in \textit{support-vector machine} (SVM) classifiers. More recently, various methods using deep learning have been proposed, such as \textit{TD-BERT} by Gao et al.\ \cite{RefWorks:doc:5ed3c3bbe4b0445db7f0a369}, which uses the BERT language model detailed in Section \ref{sec:BERT} to obtain state-of-the-art performance in the task.
However, both methods were trained and tested in the same domain including tweets about celebrities, companies and consumer electronics. The performance would likely drop substantially in a separate domain, as the sentiment polarity of a word can be highly dependent on context: for example the adjective \textit{hard} has a positive connotation when describing a protective case, but a negative connotation when describing an armchair.
......@@ -492,9 +495,9 @@ Although an argumentation dialogue has been defined for ADA, a conversational us
\subsection{Botplications}
Klopfenstein et al.\ define a \textit{Botplication} as 'an agent that is endowed with a conversational interface accessible through a messaging platform, which provides access to data, services, or enables the user to perform a specific task'. To extend ADA with such functionality, we would implement a messaging interface through which the user can request review explanations in line with the argumentation dialogue in Definition \ref{def:argdialogue}. Instead of demanding the user to type out a request, the interface might implement structured message forms, such as preset replies or an interactive list of available commands. The advantage of structured messages is that they constrain the conversation into a limited number of expected outcomes and assist the user in using the interface.
Klopfenstein et al.\ define a \textit{Botplication} as "an agent that is endowed with a conversational interface accessible through a messaging platform, which provides access to data, services, or enables the user to perform a specific task". To extend ADA with such functionality, we would implement a messaging interface through which the user can request review explanations in line with the argumentation dialogue in Definition \ref{def:argdialogue}. Instead of demanding the user to type out a request, the interface might implement structured message forms, such as preset replies or an interactive list of available commands. The advantage of structured messages is that they constrain the conversation into a limited number of expected outcomes and assist the user in using the interface.
The alternative to structured messages would be to use NLP to extract commands and intent from the user’s messages. However, Klopfenstein et al.\ argue that for a single bot, natural language should be avoided where possible, as 'going after AI is mostly excessive and counterproductive, when the same results can be obtained with simple text commands using a limited structured language'.
The alternative to structured messages would be to use NLP methods to extract commands and intent from the user’s messages. However, Klopfenstein et al.\ argue that for a single bot, natural language should be avoided where possible, as "going after AI is mostly excessive and counterproductive, when the same results can be obtained with simple text commands using a limited structured language".
% + memory
......@@ -507,6 +510,7 @@ Gunasekara et al.\ recently proposed the more general method of \textit{Quantise
Although implementing Quantised Dialog for the relatively simple domain of ADA would be excessive, we could combine some of its features with our existing semantic analysis methods developed for ADA's review aggregations. The user will ask semantically loaded questions about the product and its features, which should be covered by the same feature extraction and feature-dependent sentiment analysis methods as the review texts. If we can extract the semantics behind the queries and group them with one of the explanation requests of Definition \ref{def:argdialogue}, we can answer them with the predefined responses. However, this assumes that the user has sufficient information about the kind of responses the agent can produce.
\subsection{Evaluation using the RRIMS properties}
\label{sec:conv_eval}
Radlinski et al.\ propose that a conversational search system should satisfy the following five properties, termed the \textit{RRIMS properties}:
......
\chapter{Ontology extraction}
\label{chap:ontology}
\section{Exploration}
In this paper, we will limit the extraction of features to unigram, bigram, and trigram nouns, as the vast majority of terms for products and features fall into these categories. Although not strictly necessary, this will greatly help limit our search for features within the review texts with little effect on the recall of our model.
Since there does not exist a domain-independent common-sense ontology that would suit our purposes, we will extract the ontology ourselves from the review texts using NLP methods. In this paper, we will limit the extraction of features to unigram, bigram, and trigram nouns, as the vast majority of terms for products and features fall into these categories. Although not strictly necessary, this will greatly help limit our search for features within the review texts with little effect on the recall of our model.
%\subsection{ConceptNet}
%
......@@ -161,7 +160,7 @@ For each of the nouns, we take the mean of its $p_1$ votes, and accept it as an
\subsubsection{Training the model}
We trained the model on the feature extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 32. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 3 hours and 16 minutes. The final accuracy and macro F1-score on the validation set were 0.897 and 0.894, respectively.
We trained the model on the feature extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 32. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 3 hours and 16 minutes. The final accuracy and macro F1-score on the validation set were 89.70 and 89.39, respectively.
\subsection{Synonym extraction}
......@@ -213,7 +212,7 @@ Figure \ref{fig:relationBERT} shows the architecture of the BERT-based classifie
\subsubsection{Training the model}
We trained the model on the relation extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 16. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 2 hours and 5 minutes. The final accuracy and macro F1-score on the validation set were 0.834 and 0.820, respectively.
We trained the model on the relation extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 16. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 2 hours and 5 minutes. The final accuracy and macro F1-score on the validation set were 83.40 and 82.03, respectively.
\subsubsection{Ontology construction from votes}
......@@ -281,9 +280,10 @@ def get_tree(R, synsets):
\section{Evaluation}
In this section, we evaluate our ontology extraction method using human annotators both independently and against ontologies extracted using WordNet and COMeT. Furthermore, we independently evaluate the generalisation of the masked BERT method by experimenting with the number of the product categories used for its training.
In this section, we evaluate our ontology extraction method using human annotators both independently and against ontologies extracted using WordNet and COMeT. Furthermore, we independently evaluate the generality of the masked BERT models by experimenting with the number of the product categories used for their training.
\subsection{Ontology evaluation}
\label{sec:ontology_eval}
We evaluate five ontologies extracted for a variety of randomly selected products which were not included in the training data for the classifier: \textit{watches}, \textit{televisions}, \textit{necklaces}, \textit{stand mixers}, and \textit{video games}. For each product, we use 200,000 review texts as input to the ontology extractor, except for \textit{stand mixer}, for which we could only obtain 28,768 review texts due to it being a more niche category.
......@@ -309,7 +309,7 @@ We also extract ontologies for the five products from WordNet\footnote{http://wo
Since it is difficult to define a 'complete' ontology for a product, we concentrate our quantitative evaluation on the precision of the extracted ontologies. We will measure the precision of an ontology by the aggregated precision of its individual relations, which we will obtain by human annotation.
We present each of the 137 \textit{has feature} relations in the ontologies to 3 human annotators, and ask them to annotate the relation as either true of false in the context of Amazon products. The context is important, as features such as \textit{price} might not otherwise be considered a feature of a product. Using the majority vote among the annotators for each of the relations, we calculate the precision for each of the three methods and five products, and present the results in Table \ref{tab:ontology_precision} along with the total precision calculated for all 137 relations.
We present each of the 137 \textit{has feature} relations in the ontologies to 3 human annotators, and ask them to annotate the relation as either true of false in the context of Amazon products. The context is important, as features such as \textit{price} might not otherwise be considered a feature of a product. All three annotators agreed for 70.8\% of the relations. Using the majority vote among the annotators, we calculate the precision for each of the three methods and five products, and present the results in Table \ref{tab:ontology_precision} along with the total precision calculated for all 137 relations.
\begin{table}[H]
\centering
......@@ -317,32 +317,33 @@ Since it is difficult to define a 'complete' ontology for a product, we concentr
\hline
& watch & television & necklace & stand mixer & video game & total \\
\hline \hline
Our method & 0.885 & 0.864 & 0.700 & 0.882 & 1.000 & 0.848 \\
Our method & 88.46 & 86.36 & 70.00 & 88.24 & 100.00 & 84.78 \\
\hline
WordNet & 1.000 & 1.000 & 1.000 & 0.833 & - & 0.950 \\
WordNet & 100.00 & 100.00 & 100.00 & 83.33 & - & 95.00 \\
\hline
COMeT & 0.600 & 0.400 & 0.600 & 0.400 & 0.200 & 0.440 \\
COMeT & 60.00 & 40.00 & 60.00 & 40.00 & 20.00 & 44.00 \\
\hline
\end{tabular}
\caption{Precision scores for the three ontology extraction methods}
\label{tab:ontology_precision}
\end{table}
Our method achieves a total precision of 0.848, which is comparable to the in-domain validation accuracies of the entity and relation extractors (0.897 and 0.834, respectively). We note that the precision for the stand mixer ontology is equivalent to the rest of the ontologies despite using less data, which suggests that our method is effective even for products with relatively little review data.
Our method achieves a total precision of 84.78, which is comparable to the in-domain validation accuracies of the entity and relation extractors (89.70 and 83.40, respectively). We note that the precision for the stand mixer ontology is equivalent to the rest of the ontologies despite using less data, which suggests that our method is effective even for products with relatively little review data.
WordNet obtains the highest total precision score of 0.95, which is expected since its knowledge has been manually annotated by human annotators. However, WordNet extracted on average only 4 relations for each ontology, while our method extracted on average 18.4 relations. Part of this could be due its outdatedness, as its last release was nine years ago in June 2011\footnote{https://wordnet.princeton.edu/news-0}, although many of the products included in the comparison are quite timeless (\textit{necklace}, \textit{watch}). Furthermore, we observe that many of the terms extracted from WordNet, although correct, are scientific rather than common-sense (\textit{electron gun}, \textit{field magnet}), and therefore unsuitable for use in the Amazon review context.
WordNet obtains the highest total precision score of 95.00, which is expected since its knowledge has been manually annotated by human annotators. However, WordNet extracted on average only 4 relations for each ontology, while our method extracted on average 18.4 relations. Part of this could be due its outdatedness, as its last release was nine years ago in June 2011\footnote{https://wordnet.princeton.edu/news-0}, although many of the products included in the comparison are quite timeless (\textit{necklace}, \textit{watch}). Furthermore, we observe that many of the terms extracted from WordNet, although correct, are scientific rather than common-sense (\textit{electron gun}, \textit{field magnet}), and therefore unsuitable for use in the Amazon review context.
The precision of our method is almost twice as good as the precision of the top five terms extracted by COMeT. Most of the erroneous relations for COMeT are either remnants of the unstructured information on ConceptNet (\textit{game–effect of make you laugh}), or incorrectly categorised relations (\textit{watch–hand and wrist}).
In order to assess the reliability of agreement between the annotators, we calculate the \textit{Fleiss' kappa} measure $\kappa$, which calculates the degree of agreement over the degree expected by chance. The value of $\kappa$ ranges from $-1$ to $1$, with value $1$ signalling total agreement and $-1$ total disagreement. The kappa measure is generally thought of being a more reliable measure of inter-rater reliability than simple percent agreement, as it takes into account the probability of agreement by chance. We obtain $\kappa = 0.417$, which in a well-known study of the coefficient \cite{RefWorks:doc:5edfe5c1e4b064c22cd56d15} was interpreted to signify a weak level of agreement. This suggests that accurately determining \textit{feature of}-relations is difficult even for humans, which validates the high precision score obtained by our method.
\subsection{Generalisation evaluation}
\label{sec:general_eval}
In this section, we evaluate the ability of our masked BERT method to generalise for the whole domain of Amazon products. In order to do this, we train the entity and relation classifiers with five different datasets $t_1 \dots t_5$ including review instances for one to five products as shown in Table \ref{tab:dataset_products}. We evaluate the models using an unseen dataset $w_e$, which we have labelled for a sixth domain (watches). In addition, we train entity and relation classifiers on a separate in-domain dataset $w_t$, which can be evaluated with $w_e$ to obtain a in-domain score. Each of the datasets contains 50,000 instances, and all models were trained with the hyperparameter values used in Sections \ref{sec:feature_extraction} and \ref{sec:ontology_extraction}.
In this section, we evaluate the ability of our masked BERT method to generalise for the whole domain of Amazon products. In order to do this, we train the entity and relation classifiers with five incremental datasets $t_1 \dots t_5$, where $t_n$ includes reviews from $n$ different categories of products. The products included in each dataset are shown in Table \ref{tab:dataset_products}. We evaluate the models using an unseen dataset $w_e$, which we have labelled for a sixth domain (watches). In addition, we train entity and relation classifiers on a separate in-domain dataset $w_t$, which can be evaluated with $w_e$ to obtain an in-domain score. Each of the datasets contains a total of 50,000 instances, and all models were trained with the hyperparameter values used in Sections \ref{sec:feature_extraction} and \ref{sec:ontology_extraction}.
\begin{table}[H]
\centering
\begin{tabular}{|c||c|}
\begin{tabular}{|c|c|}
\hline
Dataset & Products included \\
\hline \hline
......@@ -365,21 +366,21 @@ In this section, we evaluate the ability of our masked BERT method to generalise
\label{tab:dataset_products}
\end{table}
The accuracies for each of the classifiers trained on the five datasets $t_1 \dots t_5$ are plotted in Figure \ref{fig:n_accuracies}. The in-domain accuracies obtained by the classifiers trained using the dataset $w_t$ are plotted as dashed lines. The accuracies for both entity and relation extraction increase significantly when trained with reviews for two products instead of just one, after which the accuracies appear to stay somewhat constant around 0.05 units below the in-domain accuracies. The initial increase of accuracy with number of training products is expected, since a product-specific dataset will encourage the classifier to learn product-specific features. However, it is surprising to note that training the classifier with just two products (\textit{camera} and \textit{backpack}) is enough to raise its accuracy to its domain-independent optimum.
The evaluation accuracies for each of the classifiers trained on the five datasets $t_1 \dots t_5$ are plotted in Figure \ref{fig:n_accuracies}. The in-domain accuracies obtained by the classifiers trained using the dataset $w_t$ are plotted as dashed lines. The accuracies for both entity and relation extraction increase significantly when trained with reviews for two products instead of just one, after which the accuracies appear to stay somewhat constant around 5 percentage units below the in-domain accuracies. The initial increase of accuracy with number of training products is expected, since a product-specific dataset will encourage the classifier to learn product-specific features. However, it is surprising to note that training the classifier with just two products (\textit{camera} and \textit{backpack}) is enough to raise its accuracy to its domain-independent optimum.
It appears that the domain-specific classifier has an advantage of around 0.05 units over the domain-independent classifier. This can be attributed to various domain-specific features the classifier can learn to take advantage of, such as domain-specific adjectives like \textit{swiss} or \textit{waterproof} for \textit{watch}.\footnote{It is interesting to note that the domain-independent optimum lies approximately halfway between the initial accuracy and the domain-specific accuracy. When the classifier is trained on several products, it 'forgets' its domain-specific knowledge, which results in worse accuracy in its own domain but better accuracy in the unseen domain, as its knowledge becomes more general. It makes intuitive sense that the point of context-independence lies in between the two context-specific opposites.}
It appears that the domain-specific classifier has an advantage of around 5 percentage units over the domain-independent classifier. This can be attributed to various domain-specific context features the classifier can learn to take advantage of, such as domain-specific adjectives like \textit{swiss} or \textit{waterproof} for \textit{watch}.\footnote{It is interesting to note that the domain-independent optimum lies approximately halfway between the initial accuracy and the domain-specific accuracy. When the classifier is trained on several products, it 'forgets' its domain-specific knowledge, which results in worse accuracy in its own domain but better accuracy in the unseen domain, as its knowledge becomes more general. It makes intuitive sense that the point of context-independence lies in between the two context-specific opposites.}
\begin{figure}[H]
\begin{figure}[h]
\centering
\begin{tikzpicture}
\begin{axis}[
xlabel={Training dataset $t_n$},
xlabel={Number of product categories in training data ($n$)},
ylabel={Evaluation accuracy on $w_e$},
xmin=1, xmax=5,
ymin=0.7, ymax=1.0,
ymin=70, ymax=100,
xtick={1,2,3,4,5},
ytick={0.7,0.75,0.8,0.85,0.9,0.95,1.0},
ytick={70,75,80,85,90,95,100},
legend pos=north west,
ymajorgrids=true,
grid style=dashed,
......@@ -390,7 +391,7 @@ It appears that the domain-specific classifier has an advantage of around 0.05 u
mark=triangle*,
]
coordinates {
(1,0.8051)(2,0.8426)(3,0.8333)(4,0.8389)(5,0.8435)
(1,80.51)(2,84.26)(3,83.33)(4,83.89)(5,84.35)
};
\addplot[
......@@ -398,7 +399,7 @@ It appears that the domain-specific classifier has an advantage of around 0.05 u
mark=diamond*,
]
coordinates {
(1,0.7159)(2,0.7381)(3,0.7375)(4,0.7428)(4,0.7428)(5,0.7352)
(1,71.59)(2,73.81)(3,73.75)(4,74.28)(4,74.28)(5,73.52)
};
\addplot [
......@@ -408,7 +409,7 @@ It appears that the domain-specific classifier has an advantage of around 0.05 u
samples=100,
color=blue!40!gray,
]
{0.9067};
{90.67};
\addplot [
line width=0.2mm,
......@@ -417,13 +418,13 @@ It appears that the domain-specific classifier has an advantage of around 0.05 u
samples=100,
color=orange,
]
{0.8046};
{80.46};
\legend{Entity extraction, Relation extraction}
\end{axis}
\end{tikzpicture}
\caption{Accuracies for masked BERT models trained with different numbers of products}
\caption{Accuracies on the $w_e$ test set for masked BERT models trained with datasets $t_1 \dots t_5$ containing $n$ categories of products. The accuracies of the models trained with in-domain data $w_t$ are shown as dashed lines.}
\label{fig:n_accuracies}
\end{figure}
......
......@@ -47,7 +47,7 @@ For both 1.\ and 2.\ we use BERT, a language model proposed by Devlin et al.\ \c
\centering
\begin{forest}
for tree = {l=2cm}
for tree = {draw, l=2.5cm}
[watch
[band
[links]
......@@ -55,15 +55,16 @@ For both 1.\ and 2.\ we use BERT, a language model proposed by Devlin et al.\ \c
[clasp]
]
[face
[hands]
[size]
[color]
[numbers]
[hands]
]
[price]
[quality]
[design]
[battery]
[\dots]
[\dots, draw=none, minimum width=2cm]
]
\end{forest}
......
......@@ -16,6 +16,7 @@
\usepackage{forest}
\usepackage{subcaption}
\usepackage{makecell}
\usepackage{multirow}
......
\chapter{System}
\chapter{ADA System}
In this chapter, we introduce the system architecture of our ADA implementation, as well as an interactive front-end application based on the \textit{Botplication} design principles proposed by Klopfenstein et al.\ \cite{RefWorks:doc:5e395ff6e4b02ac586d9a2c8}. We conclude by evaluating the performance and usability of our system.
\section{Architecture}
\subsection{Backend}
The full ADA system architecture diagram is shown in Figure \ref{fig:ada_architecture}. We will refer to this diagram in the following sections.
\begin{sidewaysfigure}[p]
\centering
\includegraphics[width=25cm]{images/ada_architecture.png}
\caption{The full ADA system architecture}
\label{fig:ada_architecture}
\end{sidewaysfigure}
\subsection{Back-end}
The back-end of the system consists of three distinct processes:
\begin{enumerate}
\item Ontology extraction,
\item QBAF extraction, and
\item a \textit{conversational agent} that handles interaction with the front-end application.
\end{enumerate}
Data flows on the diagram from left to right: the ontology for cameras is used to extract a QBAF for the Canon EOS 200D camera model, and the QBAF for the model is used in the conversational agent to communicate information about the model to the user via the front-end application.
While the conversational agent interacts with the user in real-time, the ontology and QBAF extraction processes run autonomously, and interact with the rest of the system only through their respective databases where they store the extracted data. This is the case for two reasons:
\begin{enumerate}
\item The ontology and QBAF extraction processes mine information from a large number of reviews, which takes a substantial amount of time. Therefore, implementing these processes in real-time would lead to unacceptable delays for the user of the system, who expects fluid interaction with the system.
\item The extracted ontology and QBAF data is often used by multiple processes: the same ontology for cameras can be used to extract a QBAF for any camera model, and the same QBAF for a particular model can be used in conversations about the model with a number of different users. Therefore, extracting an ontology or QBAF from scratch each time waste a lot of computing power.
\end{enumerate}
\subsubsection{Ontology extraction process}
The ontology extraction process uses Amazon user reviews to extract ontologies for product categories with the method detailed in Chapter \ref{chap:ontology}. As each ontology requires mining thousands of review texts (around 30,000 reviews was deemed sufficient in Section \ref{sec:ontology_eval}), extracting ontologies for each of the thousands of product categories on Amazon requires a lot of computing power. However, once an ontology for a product category such as cameras has been extracted, there is no need to update the ontology for a while assuming it is accurate. Although the composition or meaning of products can change over time, for most product categories, any changes usually happen slowly over the course of several years. Therefore, we propose that an ontology is initially extracted once for each product category, after which a background process can update the ontologies for categories with lots of new products when needed.
\subsubsection{QBAF extraction process}
The QBAF extraction process uses Amazon user reviews, the extracted ontologies, and the sentiment analysis method detailed in Chapter \ref{chap:sa} to extract QBAFs for product models with the method detailed in Section \ref{sec:ADA_bg}. As for ontology extraction, the initial extraction of QBAFs for all Amazon products is a costly process. However, unlike for the ontology extraction, it is important that the QBAF for a product is updated for each new review for it, in order for the explanations to accurately reflect the entire set of reviews. Therefore, the QBAF extraction requires a continuous background process. Note, however, that this background process is not as expensive as the initialisation of the QBAFs, as the computationally heavy tasks of feature detection and sentiment analysis will only have to be performed for the new review instance.
\subsubsection{Conversational agent}
The conversational agent is responsible for the dialogue between the user and system. The user can initiate a conversation by requesting information about a particular product on the front-end application. The conversational agent then loads the QBAF for the product from the QBAF database, which it uses to direct the conversation. For each query from the user, the agent returns both a response for the query, as well as options for follow-up questions about the entities mentioned in its response. By only allowing the user to select from a pre-defined set of query options, the agent guides the conversation so that it stays in the familiar domain of the argumentation dialogue detailed in Section \ref{sec:dialogue}.
Multiple users can use the system at the same time, so the agent processes each request on its own thread in order to minimise the response time. The agent keeps track of the conversations by assigning each user a unique identifier.
\subsection{iOS Botplication}
Based on the evaluation of conversational methods in Section \ref{sec:conv_eval},
we chose to implement a Botplication front-end for ADA, which interacts with the back-end via a network connection to the conversational agent. Figure \ref{fig:mixer_screenshots} shows three screenshots of the application.
The first screenshot \ref{fig:mixer1} shows a simple product selection screen, which the user can use to browse products. As our resources our limited, we cannot mine ontologies and QBAFs for the whole set of Amazon products, so the product selection screen displays only a small selection of products; a fully developed system would include a product search functionality.
Once the user has selected a product they are interested in, ADA initiates the conversation by asking the user what they would like to know about the product. The user can tap on any of ADA's messages to reveal a set of possible questions determined by the argumentation dialogue. The subjects of these questions are the arguments mentioned in the message, which are highlighted in bold. For example, in \ref{fig:mixer2}, the user can ask about either the mixer, the motor, or the bowl, for which two query options are presented.
An example of a short conversation between the user and ADA is shown in \ref{fig:mixer3}. The conversation starts from a general view of the reviewers' sentiment towards the product, and from there delves deeper into more specific aspects of the product by utilising its ontology. Through this conversation, the user not only gains a better understanding of why the product was highly rated, but possibly also discovers more about the importance of various aspects of the product, which supports the \textit{user revealment} property introduced in Section \ref{sec:conv_eval}. To explore various aspects of the product, the user can at any point return to a previous point in the conversation by tapping on a previous message, which is one key advantage of the message-based Botplication design.
\begin{figure}[h]
\makebox[\textwidth][c]{
\begin{subfigure}{6cm}
\centering
\frame{\includegraphics[height=11cm]{images/selection_screen.png}}
\caption{Product selection screen}
\label{fig:mixer1}
\end{subfigure}%
\begin{subfigure}{6cm}
\centering
\frame{\includegraphics[height=11cm]{images/mixer_example_1.png}}
\caption{User controls}
\label{fig:mixer2}
\end{subfigure}%
\begin{subfigure}{6cm}
\centering
\frame{\includegraphics[height=11cm]{images/mixer_example_2.png}}
\caption{Example of a conversation}
\label{fig:mixer3}
\end{subfigure}
}
\caption{Screenshots of the ADA botplication}
\label{fig:mixer_screenshots}
\end{figure}
\subsection{iOS botplication}
\section{Evaluation}