Commit 12d26ff1 authored by  Joel  Oksanen's avatar Joel Oksanen

Finished ontology implementation

parent 74ef519c
\appendix
\chapter{First Appendix}
\ No newline at end of file
\chapter{Extracted ontologies}
\label{sec:ontology_appendix}
\ No newline at end of file
@inproceedings{RefWorks:doc:5edca760e4b0ef3565a5f38d,
author={Tomas Mikolov and Ilya Sutskever and Kai Chen and Greg S. Corrado and Jeff Dean},
year={2013},
title={Distributed representations of words and phrases and their compositionality},
booktitle={Advances in neural information processing systems},
pages={3111-3119}
}
@inproceedings{RefWorks:doc:5edc9ecbe4b03b813c4d4381,
author={Jianmo Ni and Jiacheng Li and Julian McAuley},
year={2019},
title={Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={188-197}
}
@article{RefWorks:doc:5edbafdde4b0482c79eb8d95,
author={Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey Dean},
year={2013},
......
......@@ -12,7 +12,7 @@ In this paper, we will limit the extraction of features to unigram, bigram, and
\section{Implementation}
Our method of ontology extraction is a multi-step pipeline using both hand-crafted grammatical features and two BERT-based models trained using \textit{distantly supervised learning}. The first model is used for \textit{named-entity recognition} (NER) of the various features of the product, while the second model is used for \textit{relation extraction} (RE) in order to extract sub-feature relations between the recognised features. In addition, we use a \textit{Word2Vec} \cite{RefWorks:doc:5edbafdde4b0482c79eb8d95} model to extract word vectors, which are used to group the features into sets of synonyms, or \textit{synsets}, using a method proposed by Leeuwenberg et al.\ \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2}. The pipeline is structured as follows:
Our method of ontology extraction is a multi-step pipeline using both hand-crafted grammatical features and two BERT-based models trained using \textit{distantly supervised learning} to extract the product ontology from review texts for the product. The first model is used for \textit{named-entity recognition} (NER) of the various features of the product, while the second model is used for \textit{relation extraction} (RE) in order to extract sub-feature relations between the recognised features. In addition, we use a \textit{Word2Vec} \cite{RefWorks:doc:5edbafdde4b0482c79eb8d95} model to extract word vectors, which are used to group the features into sets of synonyms, or \textit{synsets}, using a method proposed by Leeuwenberg et al.\ \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2}. The pipeline is structured as follows:
\begin{enumerate}
\item Noun extraction
......@@ -24,6 +24,7 @@ Our method of ontology extraction is a multi-step pipeline using both hand-craft
In this section, we will first detail the annotation method used to obtain the training data for the two BERT-based models, after which we will go through the different pipeline steps in detail.
\subsection{Annotation of training data for masked BERT}
\label{sec:annotation}
Annotating training data that would be representative of the whole set of Amazon products would be nearly impossible due to the sheer number of different product categories on Amazon. However, in review texts, certain grammatical constructs stay the same regardless of the product. Take for example the two review sentences:
\begin{center}
......@@ -112,7 +113,7 @@ For the relation extraction training data, the program will select sentences wit
\label{tab:re_training_data}
\end{table}
We used the program to obtain training data for a variety of five randomly selected products: digital cameras, backpacks, laptops, guitars, and cardigans. After resampling to balance out the number of instances for each of the classes, we obtained the training data shown in Table \ref{tab:training_data}.
We used the program to obtain training data for a variety of five randomly selected products: digital cameras, backpacks, laptops, guitars, and cardigans. The categorised review data was obtained from a public repository\footnote{https://nijianmo.github.io/amazon/index.html} by Jianmo et al.\ \cite{RefWorks:doc:5edc9ecbe4b03b813c4d4381}. For each of these products, we annotate the 200 most common nouns, as we observe that most of the relevant features of the product will be included within this range. After resampling to balance out the number of instances for each of the classes, we obtained the training data shown in Table \ref{tab:training_data}.
\begin{table}[h]
\centering
......@@ -136,20 +137,157 @@ We used the program to obtain training data for a variety of five randomly selec
\subsection{Noun extraction}
\label{sec:noun_extraction}
The first step of our ontology extraction method is to extract the most commonly appearing nouns in the review texts, which will be candidates for features in the following step.
The review data is divided into review texts, many of which are multiple sentences long, so we first split the texts into sentences. In this paper, we will treat each sentence as an individual unit of information, independent from other sentences in the same review text. We will then tokenise the sentences, and use an out-of-the-box implementation of a method by Mikolov et al.\ \cite{RefWorks:doc:5edca760e4b0ef3565a5f38d} to join common co-occurrences of tokens into bigrams and trigrams. This step is crucial in order to detect multi-word nouns such as \textit{operating system}, which is an important feature of \textit{computer}. After this, we use a part-of-speech tagger to select the nouns within the tokens, and count the number of occurrences for each of the nouns. Finally, as for the annotation method detailed in Section \ref{sec:annotation}, we select the 200 most common nouns and pass them onto the feature extraction step.
\subsection{Feature extraction}
For the feature extraction step, we obtain review sentences that mention exactly one of the nouns obtained in the previous step, and pass the sentences through a BERT-based classifier to obtain votes for whether the noun is an argument or not. In the end, we aggregate these votes for each of the nouns to obtain a list of extracted arguments.
\subsubsection{BERT for feature extraction}
\subsubsection{Training BERT model}
Figure \ref{fig:entityBERT} shows the architecture of the BERT-based classifier used for feature extraction. The classifier takes as input a review sentence, as well as the noun which we wish to classify as an argument or a non-argument. The tokenisation step masks the tokens associated with the noun ('operating' and 'system') with the \texttt{[MASK]} token. The tokens are passed through the transformer network, and the output used for classification is taken from the positions of the masked tokens. The input to the linear classification layer is always of the dimension of a single BERT hidden layer output, so if there are several masked tokens, a max-pooling operation is performed on their outputs. The linear layer is followed by a softmax operation, which outputs the probabilities $p_0$ and $p_1$ of the masked noun being a non-argument or an argument, respectively.
For each of the nouns, we take the mean of its $p_1$ votes, and accept it as an argument if the mean is above 0.65, a hyperparameter tuned through validation to strike a good balance between precision and recall of the feature extraction. Using the raw output probabilities from the network rather than binary votes allows us to bias the aggregate towards more certain predictions of the model.
\begin{figure}[h]
\centering
\includegraphics[width=12cm]{images/entity_bert.png}
\caption{BERT for feature extraction}
\label{fig:entityBERT}
\end{figure}
\subsubsection{Training the model}
We trained the model on the feature extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 32. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 3 hours and 16 minutes. The final accuracy and macro F1-score on the validation set were 0.897 and 0.894, respectively.
\subsection{Synonym extraction}
Reviewers can refer to the same argument using many different terms; for example the argument \textit{laptop} can be referred to with the terms \textit{computer}, \textit{device}, and \textit{product}. In order to construct an ontology tree, we must be able to group all of these terms under the same node. However, terms like \textit{laptop} and \textit{product} are not synonyms in the strict sense of the word, even if they are interchangeable within the review texts. Therefore, we cannot use a pre-existing synonym dictionary to group the arguments.
However, since the terms are interchangeable within the review texts, we can once again utilise the context of the words to group words with similar contexts into synsets. In order to compare the contexts of words, we must obtain context-based representations for them. One such representation is called a \textit{word embedding}, which is a high-dimensional vector in a vector space where similar words are close to each other. We can obtain review-domain word embeddings by training a \textit{Word2Vec} model on the review texts. The Word2Vec model learns the word embeddings by attempting to predict each word in the text corpus from a window of surrounding words.
We use a relatively small window of 7 words, exemplified by the following two review sentences where the window is underlined for the terms \textit{laptop} and \textit{product}:
\begin{center}
\textit{I \underline{would recommend this \textbf{laptop} to my friends}, although the keyboard isn't perfect}
and
\textit{I \underline{would recommend this \textbf{product} to my friends}, as it is the best purchase I've ever made.}
\end{center}
The windows for \textit{laptop} and \textit{product} are identical, which means that their word embeddings will be similar. The small window ensures that the focus is on the interchangeability of the words, rather than their relatedness on larger scale. As the above two sentences illustrate, the terms \textit{laptop} and \textit{product} might be used in slightly different contexts on a larger scale, but their meaning, which is expressed in the nearby text, stays the same. Furthermore, the small window size prevents sibling arguments from being grouped together based on their association with their parent argument, as exemplified in these two review texts:
\begin{center}
\textit{I like this lens because \underline{of the convenient \textbf{zoom} functionality which works} like a dream}
and
\textit{I like this lens because the \underline{quality of its \textbf{glass} takes such clear} pictures.}
\end{center}
Although both \textit{zoom} and \textit{glass} are mentioned in association with their parent argument \textit{lens}, their nearby contexts are very different.
Once we have obtained the word embeddings, we can use the \textit{relative cosine similarity} of the vectors to group them into synsets, as proposed by Leeuwenberg et al.\ in \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2}, who showed that relative cosine similarity is more accurate of a measure for synonymy than cosine similarity. The cosine similarity relative to the top $n$ most similar words between word embeddings $w_i$ and $w_j$ is calculated with the following formula:
$$rcs_n(w_i,w_j) = \frac{cosine\_similarity(w_i,w_j)}{\sum_{w_c \in TOP_n}cosine\_similarity(w_i,w_c)},$$
where $TOP_n$ is a set of the $n$ most similar words to $w_i$. In this paper, we use $n=10$. If $rcs_{10}(w_i,w_j) > 0.10$, $w_i$ is more similar to $w_j$ than an arbitrary similar word from $TOP_{10}$, which was shown in \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2} to be a good indicator of synonymy.
Let arguments $a_1$ and $a_2$ be synonyms if $rcs_{10}(a_1,a_2) \geq 0.11$. Then we group the arguments $\mathcal{A}$ into synsets $\mathcal{S}$ where
$$\forall a_1,a_2 \in \mathcal{A}. \ \forall s \in \mathcal{S}. \ rcs_{10}(a_1,a_2)\geq0.11 \wedge a_1 \in s \implies a_2 \in s,$$
given that $$\forall a \in \mathcal{A}. \ \exists s \in \mathcal{S}. \ a \in s.$$
\subsection{Ontology extraction}
The synsets obtained in the previous step will form the nodes of the ontology tree. In this step, we will extract the sub-feature relations that will allow us to construct the shape of the tree. In order to do this, we obtain review sentences that mention a word from exactly two synsets, and pass the sentences through a BERT-based classifier to obtain votes for whether the arguments are related, and if they are, which of the arguments is feature of the other. In the end, we aggregate these votes within each of the synsets to obtain a relatedness measure between each of the synset pairs, which we use to construct the ontology.
\subsubsection{BERT for relation extraction}
\subsubsection{Training BERT model}
Figure \ref{fig:relationBERT} shows the architecture of the BERT-based classifier we use for relation extraction. The classifier takes as input a review sentence, as well as the two arguments $a_1$ and $a_2$ for which we wish to obtain one of three labels: 0 if $a_1$ and $a_2$ are not related, 1 if $a_2$ is a feature of $a_1$, and 2 if $a_1$ is a feature of $a_2$. The tokenisation step masks the tokens associated with the arguments ('laptop', 'operating', and 'system') with the \texttt{[MASK]} token. The tokens are passed through the transformer network, and the output used for classification is taken from the positions of the masked tokens for the two arguments. If an argument consists of several tokens, a max pooling operation is performed on the outputs for each of the tokens such that we obtain a single vector for both arguments. The vectors are then concatenated and passed onto a linear classification layer with an output for each of the three labels. The linear layer is followed by a softmax operation, which outputs the probabilities $p_0$, $p_1$, and $p_2$ of the three labels.
\begin{figure}[h]
\centering
\includegraphics[width=12cm]{images/relation_bert.png}
\caption{BERT for relation extraction}
\label{fig:relationBERT}
\end{figure}
\subsubsection{Training the model}
We trained the model on the relation extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 16. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 2 hours and 5 minutes. The final accuracy and macro F1-score on the validation set were 0.834 and 0.820, respectively.
\subsubsection{Ontology construction from votes}
\section{Evaluation}
\ No newline at end of file
Let $N$ be the number of synsets and $V \in \mathbb{N}^{N \times N}$ be a matrix where we accumulate the relation votes between each of the synsets. $V$ is initialised with zeroes, and for each vote $(p_0,p_1,p_2)$ on arguments belonging to synsets $s_n$ and $s_m$ we accumulate the element $v_{m,n}$ of $V$ by $p_1$ and the element $v_{n,m}$ by $p_2$. In the end, element $v_{i,j}$ of $V$ contains the sum of votes that $s_i$ is a feature of $s_j$.
Let $n_{i,j}$ be the total number of input sentences to the relation classifier with arguments from $s_i$ and $s_j$. Then
$$\bar{v}_{i,j} = \frac{v_{i,j}}{n_{i,j}}$$
is the mean vote for $s_i$ being a feature of $s_j$. However, this is not a reliable measure of relatedness on its own, as many unrelated arguments might only appear in a few sentences together, which is not enough data to guarantee an accurate representation of their relatedness. On the contrary, if $a_1$ is a feature of $a_2$, $a_1$ is likely to appear often in conjunction with $a_2$. We can use this observation to improve the accuracy of the relatedness measure.
Let $c_i$ be the total count for occurrences of an argument from $s_i$ in the review texts. Then
$$\tau_{i,j} = \frac{n_{i,j}}{c_i}$$
is a relative measure for how often an argument from $s_i$ appears in conjunction with an argument from $s_j$. If we scale $\hat{v}_{i,j}$ by $\tau_{i,j}$, we obtain a more accurate measure for relatedness,
$$r_{i,j} = \hat{v}_{i,j} \times \tau_{i,j} = \frac{v_{i,j}}{c_i}.$$
Using this formula, we define the \textit{relation matrix}
$$R = V \mathbin{/} \textbf{c},$$
where $\textbf{c}$ is a vector containing the counts $c_i$ for each $s_i \in S$.
We know that the product itself forms the root of the ontology tree, so we do not have to consider the product synset being a sub-feature of another synset. For each of the remaining synsets $s_i$, we calculate its super-feature $\hat{s}_i$ using row $r_i$ of the relation matrix, which contains the relatedness scores from $s_i$ to the other synsets. For example, the row corresponding to the synset of \textit{zoom} could be as follows:
\begin{center}
{\renewcommand{\arraystretch}{1.2}
\begin{tabular}{|c|c|c|c|c|c|}
\hline
camera & lens & battery & screen & zoom & quality \\
\hline
0.120 & 0.144 & 0.021 & 0.041 & - & 0.037 \\
\hline
\end{tabular}
}
\end{center}
Clearly, \textit{zoom} appears to be a feature of \textit{lens}, as the relatedness score for \textit{lens} is higher than for any other feature. Also the relatedness score for the product \textit{camera} is high, as is expected for any feature since any descendant of a product in the ontology is considered its sub-feature, as defined in Section \ref{sec:annotation}. Based on experimentation, we define $\hat{s_i}=s_j$ where $j = argmax(r_i)$, although other heuristics could work here as well.
Using the super-feature relations, we build the ontology tree from the root down with the function shown in pseudocode in Figure \ref{fig:gettree}.
\begin{figure}[H]
\centering
\begin{tabular}{c}
\begin{lstlisting}
def get_tree(R, synsets):
root = synsets.pop(product) # set product synset as root
# insert all direct children of product
for s in synsets if s.super == product:
add_child(root, synsets.pop(s))
for s in synsets sorted in descending order by R[s][s.super]:
if descendant(tree, s.super):
# super-feature of s already in tree
if depth(s.super) < 2:
add_child(s.super, s)
else:
# max depth would be exceeded so set as sibling instead
add_child(parent(s.super), s)
else:
# super-feature of s not yet in tree
add_child(root, synsets.pop(s.super))
add_child(s.super, s)
return root
\end{lstlisting}
\end{tabular}
\caption{Function for constructing the ontology tree}
\label{fig:gettree}
\end{figure}
\section{Evaluation}
We evaluate our ontology extraction method using human annotators both independently and against ontologies extracted using ConceptNet and WordNet. We evaluate the ontologies extracted for a variety of five randomly selected products which were not included in the training data for the classifier: \textit{watches}, \textit{televisions}, . The full ontologies extracted for these products are included in Appendix \ref{sec:ontology_appendix}.
Furthermore, we independently evaluate the generalisation of the masked BERT method by experimenting with the number of the product categories used for its training.
\subsection{Ontology evaluation}
\subsection{Generalisation evaluation}
......@@ -56,6 +56,8 @@ For both 1.\ and 2.\ we use BERT, a language model proposed by Devlin et al.\ \c
]
[face
[hands]
[size]
[color]
]
[price]
[quality]
......
......@@ -16,6 +16,11 @@
\usepackage[edges]{forest}
\usepackage{multirow}
\usepackage{listings}
\lstset{basicstyle=\ttfamily\footnotesize,breaklines=true}
\renewcommand{\figurename}{Listing}
\usepackage{float}
\usepackage{amsthm}
%% \DeclareMathSymbol{\Alpha}{\mathalpha}{operators}{"41}
%% \usepackage[]{algorithm2e}
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment