feature_extraction.tex 25.4 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
\chapter{Ontology extraction}

\section{Exploration}

In this paper, we will limit the extraction of features to unigram, bigram, and trigram nouns, as the vast majority of terms for products and features fall into these categories. Although not strictly necessary, this will greatly help limit our search for features within the review texts with little effect on the recall of our model.

%\subsection{ConceptNet}
%
%\subsection{Hand-crafted features}
%
%\subsection{Masked BERT for unsupervised ontology extraction}

\section{Implementation}

15
Our method of ontology extraction is a multi-step pipeline using both hand-crafted grammatical features and two BERT-based models trained using \textit{distantly supervised learning} to extract the product ontology from review texts for the product. The first model is used for \textit{named-entity recognition} (NER) of the various features of the product, while the second model is used for \textit{relation extraction} (RE) in order to extract sub-feature relations between the recognised features. In addition, we use a \textit{Word2Vec} \cite{RefWorks:doc:5edbafdde4b0482c79eb8d95} model to extract word vectors, which are used to group the features into sets of synonyms, or \textit{synsets}, using a method proposed by Leeuwenberg et al.\ \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2}. The pipeline is structured as follows: 
16
17
18
19
20
21
22
23
24
25
26

\begin{enumerate}
	\item Noun extraction
	\item Feature extraction
	\item Synonym extraction
	\item Ontology extraction.
\end{enumerate}

In this section, we will first detail the annotation method used to obtain the training data for the two BERT-based models, after which we will go through the different pipeline steps in detail.

\subsection{Annotation of training data for masked BERT}
27
\label{sec:annotation}
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115

Annotating training data that would be representative of the whole set of Amazon products would be nearly impossible due to the sheer number of different product categories on Amazon. However, in review texts, certain grammatical constructs stay the same regardless of the product. Take for example the two review sentences:
\begin{center}
\textit{I love the \textbf{lens} on this \textbf{camera}} \quad and \quad \textit{I love the \textbf{material} of this \textbf{sweater}.}
\end{center}
Clearly, \textit{lens} is a feature of \textit{camera} and \textit{material} is a feature of \textit{sweater}. If we mask the entities mentioned in the two sentences, we obtain:
\begin{center}
\textit{I love the \textbf{e1} on this \textbf{e2}} \quad and \quad \textit{I love the \textbf{e1} of this \textbf{e2},}
\end{center}
which are nearly identical. So while the entities in review texts are domain-specific, the context is often largely domain-independent. Therefore, using the masked sentences, it would be possible to train a classifier to recognise that \textit{e1} and \textit{e2} are both entities, and that \textit{e1} is a sub-feature of \textit{e2}.

In fact, BERT has inbuilt support for masking, as it plays a central role in its pre-training phase. One of the two tasks on which BERT is pre-trained is \textit{masked language modelling}, in which randomly chosen words are replaced with a \texttt{[MASK]} token and the model is asked to predict the masked words in a large corpus of text.

Of course, the above example is an idealised scenario. If we mask the entities in the following review sentences:
\begin{center}
\textit{The \textbf{camera housing} is made of shiny black \textbf{plastic} but it feels nicely weighted and solid}

 and
 
 \textit{It's a lovely warm \textbf{jumper} that has a nice \textbf{feel} to it,}
\end{center}
we obtain the masked sentences:
\begin{center}
\textit{The \textbf{e1} is made of shiny black \textbf{e2} but it feels nicely weighted and solid}

and

\textit{It's a lovely warm \textbf{e1} that has a nice \textbf{e2} to it,}
\end{center}
which still contain some rather domain-specific terms such as \textit{shiny}, \textit{weighted}, and \textit{warm}. However, these words are less likely to be product-specific, but are often common to wider categories of products, such as electronics or clothing. Therefore, there is no need to annotate each and every product to represent the whole set of Amazon products, but a small and varied set of products should be enough to train a domain-independent classifier.

Even annotating texts for just a few products would still require the annotation of at least thousands of texts in order to obtain a sufficiently large dataset. However, we can reduce the amount of work substantially by taking advantage of \textit{distantly supervised learning}, where rather than annotating each and every sentence by hand, we automatically annotate a large amount of text using predetermined heuristics. In our case, the heuristic will be a manually labelled ontology for the product: using this ontology, an automated process can search review texts for terms that appear in the ontology and label as well as mask them correctly. This will allow us to easily annotate data for multiple products, as we will only have to annotate their ontologies, which usually consist of no more than a hundred terms.

 Notice that distant supervision is made possible here by the masking of the terms in the ontologies. We are annotating a much smaller set of terms than the size of the resulting training set, which means that the training set will be highly saturated with the annotated terms. Therefore, if the words weren't masked, the classifier would simply learn to recognise the terms in the ontology and completely ignore their context. However, due to the masking of the terms, the classifier is forced to rely solely on their context, which is more varied.

\begin{figure}[h]
	\centering
	\includegraphics[width=12cm]{images/entity_annotator.png}
	\caption{Entity annotator interface}
	\label{fig:entity_annotator}
\end{figure}

A program was written to ease the annotation of the ontologies, the interface of which is shown in Figure \ref{fig:entity_annotator} for sweater reviews. The program takes as input review sentences for a given product category such as sweaters, and counts the number of times each noun occurs in the sentences using the same method as the ontology extractor, which is detailed in Section \ref{sec:noun_extraction}. It then displays the nouns one by one to the annotator in descending order of count, and for each noun, the annotator will label it as either the root of the ontology tree (the product), a sub-feature or a synonym of an existing node in the tree, or \textit{nan} for nouns that are not an argument (for example \textit{daughter} in the review text \textit{my daughter loves it}). In order to simplify the annotation process, a noun with a lower count can only be annotated as a sub-feature of a noun with a higher count, as the reverse is rarely true. The annotation of \textit{nan} nouns is important as it allows us to obtain negative samples for the training data.

Once the annotator has labelled a sufficient number of nouns, the program can use the annotated ontology to label the entire set of review sentences to be used as training data for either the feature or relation extraction models. 

For the feature extraction training data, the program will select sentences with exactly one of the annotated nouns and label the sentence with a binary label representing if the masked word is an argument or not. Although selecting only a subset of the sentences will limit the amount of training data, it allows us to reduce the NER task to a binary classification problem instead of a sequence-labelling problem. Furthermore, even relatively small product categories have large amounts of review data available, so we can afford to prune some of it in order to improve the accuracy of our model. Table \ref{tab:fe_training_data} shows a positive and a negative example.

\begin{table}[h]
	\centering
	{\renewcommand{\arraystretch}{1.2}
 	\begin{tabular}{|c|c|c|}
 	\hline
 	\texttt{text} & \texttt{noun} & \texttt{is\textunderscore argument} \\
 	\hline \hline
 	"I love this sweater." & "sweater" & 1 \\ 
 	\hline
 	"My daughter loves it!" & "daughter" & 0 \\
 	\hline
	\end{tabular}
	}
	\caption{Example training data for feature extraction}
	\label{tab:fe_training_data}
\end{table}

For the relation extraction training data, the program will select sentences with exactly two of the annotated nouns, and mask both of the nouns as well as label the sentence with one of three labels: 0 for no relation between the masked nouns, 1 if the second masked noun is a feature of the first masked noun, and 2 if the first masked noun is a feature of the second masked noun. A noun $n_1$ is considered a feature of noun $n_2$ iff $n_1$ is a descendant of $n_2$ in the annotated ontology tree. For example, \textit{fabric} is considered a feature of both \textit{sweater} and \textit{material} based on the ontology tree in Figure \ref{fig:entity_annotator}. Table \ref{tab:re_training_data} shows examples for each of the labels.

\begin{table}[h]
	\centering
	{\renewcommand{\arraystretch}{1.2}
 	\begin{tabular}{|c|c|c|c|}
 	\hline
 	\texttt{text} & \texttt{noun\textunderscore 1} & \texttt{noun\textunderscore 2} & \texttt{is\textunderscore argument} \\
 	\hline \hline
 	"I like the colour and the material." & "colour" & "material" & 0 \\ 
 	\hline
 	"My daughter loves this sweater." & "daughter" & "sweater" & 0 \\ 
 	\hline
 	"The sweater's fabric is so soft." & "sweater" & "fabric" & 1 \\
 	\hline
 	"The colour of the sweater is beautiful." & "colour" & "sweater" & 2 \\
 	\hline
	\end{tabular}
	}
	\caption{Example training data for relation extraction}
	\label{tab:re_training_data}
\end{table}

116
We used the program to obtain training data for a variety of five randomly selected products: digital cameras, backpacks, laptops, guitars, and cardigans. The categorised review data was obtained from a public repository\footnote{https://nijianmo.github.io/amazon/index.html} by Jianmo et al.\ \cite{RefWorks:doc:5edc9ecbe4b03b813c4d4381}. For each of these products, we annotate the 200 most common nouns, as we observe that most of the relevant features of the product will be included within this range. After resampling to balance out the number of instances for each of the classes, we obtained the training data shown in Table \ref{tab:training_data}.
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

\begin{table}[h]
	\centering
	{\renewcommand{\arraystretch}{1.2}
 	\begin{tabular}{|c||c|c|}
 	\hline
 	& \multicolumn{2}{c|}{Number of training instances} \\
 	\cline{2-3}
 	& per product & total \\
 	\hline \hline
 	Feature extraction & 56,526 & 282,630 \\ 
 	\hline
 	Relation extraction & 25,110 & 125,550 \\ 
 	\hline
	\end{tabular}
	}
	\caption{Training data counts for ontology extraction}
	\label{tab:training_data}
\end{table}

\subsection{Noun extraction}
\label{sec:noun_extraction}

140
141
142
143
The first step of our ontology extraction method is to extract the most commonly appearing nouns in the review texts, which will be candidates for features in the following step.

The review data is divided into review texts, many of which are multiple sentences long, so we first split the texts into sentences. In this paper, we will treat each sentence as an individual unit of information, independent from other sentences in the same review text. We will then tokenise the sentences, and use an out-of-the-box implementation of a method by Mikolov et al.\ \cite{RefWorks:doc:5edca760e4b0ef3565a5f38d} to join common co-occurrences of tokens into bigrams and trigrams. This step is crucial in order to detect multi-word nouns such as \textit{operating system}, which is an important feature of \textit{computer}. After this, we use a part-of-speech tagger to select the nouns within the tokens, and count the number of occurrences for each of the nouns. Finally, as for the annotation method detailed in Section \ref{sec:annotation}, we select the 200 most common nouns and pass them onto the feature extraction step.

144
145
\subsection{Feature extraction}

146
147
For the feature extraction step, we obtain review sentences that mention exactly one of the nouns obtained in the previous step, and pass the sentences through a BERT-based classifier to obtain votes for whether the noun is an argument or not. In the end, we aggregate these votes for each of the nouns to obtain a list of extracted arguments. 

148
149
\subsubsection{BERT for feature extraction}

150
151
152
153
154
155
156
157
158
159
160
161
162
163
Figure \ref{fig:entityBERT} shows the architecture of the BERT-based classifier used for feature extraction. The classifier takes as input a review sentence, as well as the noun which we wish to classify as an argument or a non-argument. The tokenisation step masks the tokens associated with the noun ('operating' and 'system') with the \texttt{[MASK]} token. The tokens are passed through the transformer network, and the output used for classification is taken from the positions of the masked tokens. The input to the linear classification layer is always of the dimension of a single BERT hidden layer output, so if there are several masked tokens, a max-pooling operation is performed on their outputs. The linear layer is followed by a softmax operation, which outputs the probabilities $p_0$ and $p_1$ of the masked noun being a non-argument or an argument, respectively.

For each of the nouns, we take the mean of its $p_1$ votes, and accept it as an argument if the mean is above 0.65, a hyperparameter tuned through validation to strike a good balance between precision and recall of the feature extraction. Using the raw output probabilities from the network rather than binary votes allows us to bias the aggregate towards more certain predictions of the model.

\begin{figure}[h]
	\centering
	\includegraphics[width=12cm]{images/entity_bert.png}
	\caption{BERT for feature extraction}
	\label{fig:entityBERT}
\end{figure}

\subsubsection{Training the model}

We trained the model on the feature extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 32. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 3 hours and 16 minutes. The final accuracy and macro F1-score on the validation set were 0.897 and 0.894, respectively.
164
165
166

\subsection{Synonym extraction}

167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
Reviewers can refer to the same argument using many different terms; for example the argument \textit{laptop} can be referred to with the terms \textit{computer}, \textit{device}, and \textit{product}. In order to construct an ontology tree, we must be able to group all of these terms under the same node. However, terms like \textit{laptop} and \textit{product} are not synonyms in the strict sense of the word, even if they are interchangeable within the review texts. Therefore, we cannot use a pre-existing synonym dictionary to group the arguments.

However, since the terms are interchangeable within the review texts, we can once again utilise the context of the words to group words with similar contexts into synsets. In order to compare the contexts of words, we must obtain context-based representations for them. One such representation is called a \textit{word embedding}, which is a high-dimensional vector in a vector space where similar words are close to each other. We can obtain review-domain word embeddings by training a \textit{Word2Vec} model on the review texts. The Word2Vec model learns the word embeddings by attempting to predict each word in the text corpus from a window of surrounding words. 

We use a relatively small window of 7 words, exemplified by the following two review sentences where the window is underlined for the terms \textit{laptop} and \textit{product}:
\begin{center}
\textit{I \underline{would recommend this \textbf{laptop} to my friends}, although the keyboard isn't perfect}

 and
 
 \textit{I \underline{would recommend this \textbf{product} to my friends}, as it is the best purchase I've ever made.}
\end{center}
The windows for \textit{laptop} and \textit{product} are identical, which means that their word embeddings will be similar. The small window ensures that the focus is on the interchangeability of the words, rather than their relatedness on larger scale. As the above two sentences illustrate, the terms \textit{laptop} and \textit{product} might be used in slightly different contexts on a larger scale, but their meaning, which is expressed in the nearby text, stays the same. Furthermore, the small window size prevents sibling arguments from being grouped together based on their association with their parent argument, as exemplified in these two review texts:
\begin{center}
\textit{I like this lens because \underline{of the convenient \textbf{zoom} functionality which works} like a dream}

 and
 
 \textit{I like this lens because the \underline{quality of its \textbf{glass} takes such clear} pictures.}
\end{center}
Although both \textit{zoom} and \textit{glass} are mentioned in association with their parent argument \textit{lens}, their nearby contexts are very different.

Once we have obtained the word embeddings, we can use the \textit{relative cosine similarity} of the vectors to group them into synsets, as proposed by Leeuwenberg et al.\ in \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2}, who showed that relative cosine similarity is more accurate of a measure for synonymy than cosine similarity. The cosine similarity relative to the top $n$ most similar words between word embeddings $w_i$ and $w_j$ is calculated with the following formula:
$$rcs_n(w_i,w_j) = \frac{cosine\_similarity(w_i,w_j)}{\sum_{w_c \in TOP_n}cosine\_similarity(w_i,w_c)},$$
where $TOP_n$ is a set of the $n$ most similar words to $w_i$. In this paper, we use $n=10$. If $rcs_{10}(w_i,w_j) > 0.10$, $w_i$ is more similar to $w_j$ than an arbitrary similar word from $TOP_{10}$, which was shown in \cite{RefWorks:doc:5eaebe76e4b098fe9e0217c2} to be a good indicator of synonymy.

Let arguments $a_1$ and $a_2$ be synonyms if $rcs_{10}(a_1,a_2) \geq 0.11$. Then we group the arguments $\mathcal{A}$ into synsets $\mathcal{S}$ where
 $$\forall a_1,a_2 \in \mathcal{A}. \ \forall s \in \mathcal{S}. \ rcs_{10}(a_1,a_2)\geq0.11 \wedge a_1 \in s \implies a_2 \in s,$$
given that $$\forall a \in \mathcal{A}. \ \exists s \in \mathcal{S}. \ a \in s.$$

197
198
\subsection{Ontology extraction}

199
200
The synsets obtained in the previous step will form the nodes of the ontology tree. In this step, we will extract the sub-feature relations that will allow us to construct the shape of the tree. In order to do this, we obtain review sentences that mention a word from exactly two synsets, and pass the sentences through a BERT-based classifier to obtain votes for whether the arguments are related, and if they are, which of the arguments is feature of the other. In the end, we aggregate these votes within each of the synsets to obtain a relatedness measure between each of the synset pairs, which we use to construct the ontology.

201
202
\subsubsection{BERT for relation extraction}

203
204
205
206
207
208
209
210
211
212
213
214
Figure \ref{fig:relationBERT} shows the architecture of the BERT-based classifier we use for relation extraction. The classifier takes as input a review sentence, as well as the two arguments $a_1$ and $a_2$ for which we wish to obtain one of three labels: 0 if $a_1$ and $a_2$ are not related, 1 if $a_2$ is a feature of $a_1$, and 2 if $a_1$ is a feature of $a_2$. The tokenisation step masks the tokens associated with the arguments ('laptop', 'operating', and 'system') with the \texttt{[MASK]} token. The tokens are passed through the transformer network, and the output used for classification is taken from the positions of the masked tokens for the two arguments. If an argument consists of several tokens, a max pooling operation is performed on the outputs for each of the tokens such that we obtain a single vector for both arguments. The vectors are then concatenated and passed onto a linear classification layer with an output for each of the three labels. The linear layer is followed by a softmax operation, which outputs the probabilities $p_0$, $p_1$, and $p_2$ of the three labels.

\begin{figure}[h]
	\centering
	\includegraphics[width=12cm]{images/relation_bert.png}
	\caption{BERT for relation extraction}
	\label{fig:relationBERT}
\end{figure}

\subsubsection{Training the model}

We trained the model on the relation extraction data shown in Table \ref{tab:training_data}, setting aside 5\% of the data for validation. The final model was trained for 3 epochs with a batch size of 16. We used the Adam optimiser with standard cross entropy loss. The model was trained on a NVIDIA GeForce GTX 1080 GPU with 16GB RAM and took 2 hours and 5 minutes. The final accuracy and macro F1-score on the validation set were 0.834 and 0.820, respectively.
215
216
217

\subsubsection{Ontology construction from votes}

218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
Let $N$ be the number of synsets and $V \in \mathbb{N}^{N \times N}$ be a matrix where we accumulate the relation votes between each of the synsets. $V$ is initialised with zeroes, and for each vote $(p_0,p_1,p_2)$ on arguments belonging to synsets $s_n$ and $s_m$ we accumulate the element $v_{m,n}$ of $V$ by $p_1$ and the element $v_{n,m}$ by $p_2$. In the end, element $v_{i,j}$ of $V$ contains the sum of votes that $s_i$ is a feature of $s_j$.

Let $n_{i,j}$ be the total number of input sentences to the relation classifier with arguments from $s_i$ and $s_j$. Then 
$$\bar{v}_{i,j} = \frac{v_{i,j}}{n_{i,j}}$$ 
is the mean vote for $s_i$ being a feature of $s_j$. However, this is not a reliable measure of relatedness on its own, as many unrelated arguments might only appear in a few sentences together, which is not enough data to guarantee an accurate representation of their relatedness. On the contrary, if $a_1$ is a feature of $a_2$, $a_1$ is likely to appear often in conjunction with $a_2$. We can use this observation to improve the accuracy of the relatedness measure.

Let $c_i$ be the total count for occurrences of an argument from $s_i$ in the review texts. Then 
$$\tau_{i,j} = \frac{n_{i,j}}{c_i}$$
is a relative measure for how often an argument from $s_i$ appears in conjunction with an argument from $s_j$. If we scale $\hat{v}_{i,j}$ by $\tau_{i,j}$, we obtain a more accurate measure for relatedness,
$$r_{i,j} = \hat{v}_{i,j} \times \tau_{i,j} = \frac{v_{i,j}}{c_i}.$$
Using this formula, we define the \textit{relation matrix}
$$R = V \mathbin{/} \textbf{c},$$
where $\textbf{c}$ is a vector containing the counts $c_i$ for each $s_i \in S$.

We know that the product itself forms the root of the ontology tree, so we do not have to consider the product synset being a sub-feature of another synset. For each of the remaining synsets $s_i$, we calculate its super-feature $\hat{s}_i$ using row $r_i$ of the relation matrix, which contains the relatedness scores from $s_i$ to the other synsets. For example, the row corresponding to the synset of \textit{zoom} could be as follows:
\begin{center}
  {\renewcommand{\arraystretch}{1.2}
 	\begin{tabular}{|c|c|c|c|c|c|}
 	\hline
 	camera & lens & battery & screen & zoom & quality \\
 	\hline
 	0.120 & 0.144 & 0.021 & 0.041 & - & 0.037 \\ 
 	\hline
	\end{tabular}
  }
\end{center}
Clearly, \textit{zoom} appears to be a feature of \textit{lens}, as the relatedness score for \textit{lens} is higher than for any other feature. Also the relatedness score for the product \textit{camera} is high, as is expected for any feature since any descendant of a product in the ontology is considered its sub-feature, as defined in Section \ref{sec:annotation}. Based on experimentation, we define $\hat{s_i}=s_j$ where $j = argmax(r_i)$, although other heuristics could work here as well.

Using the super-feature relations, we build the ontology tree from the root down with the function shown in pseudocode in Figure \ref{fig:gettree}.

\begin{figure}[H]
\centering

\begin{tabular}{c}
\begin{lstlisting}
def get_tree(R, synsets):
    root = synsets.pop(product)  # set product synset as root
    
    # insert all direct children of product
    for s in synsets if s.super == product:
        add_child(root, synsets.pop(s))
    	
    for s in synsets sorted in descending order by R[s][s.super]:
        if descendant(tree, s.super):
            # super-feature of s already in tree
            if depth(s.super) < 2:
                add_child(s.super, s)
            else:
                # max depth would be exceeded so set as sibling instead
                add_child(parent(s.super), s)
        else:
            # super-feature of s not yet in tree
            add_child(root, synsets.pop(s.super))
            add_child(s.super, s)
   
    return root
\end{lstlisting}
\end{tabular}
\caption{Function for constructing the ontology tree}
\label{fig:gettree}
\end{figure}

\section{Evaluation}

 Joel  Oksanen's avatar
Joel Oksanen committed
282
In this section, we evaluate our ontology extraction method using human annotators both independently and against ontologies extracted using ConceptNet and WordNet. Furthermore, we independently evaluate the generalisation of the masked BERT method by experimenting with the number of the product categories used for its training.
283

 Joel  Oksanen's avatar
Joel Oksanen committed
284
\subsection{Ontology evaluation}
285

 Joel  Oksanen's avatar
Joel Oksanen committed
286
We evaluate five ontologies extracted for a variety of randomly selected products which were not included in the training data for the classifier: \textit{watches}, \textit{televisions}, \textit{necklaces}, \textit{stand mixers}, and \textit{video games}. For each product, we use 100,000 review texts as input to the ontology extractor, except for \textit{stand mixer}, for which we could only obtain 28,768 review texts. The full ontologies extracted for each of the products are included in Appendix \ref{sec:ontology_appendix}.
287

 Joel  Oksanen's avatar
Joel Oksanen committed
288
We also extract ontologies for the five products from ConceptNet and WordNet for comparison. For ConceptNet, we observe 
289
290
291
292
293

\subsection{Generalisation evaluation}