Commit d8ea8871 authored by  Joel  Oksanen's avatar Joel Oksanen
Browse files

Completed evaluation plan

parent 3fea086d
......@@ -481,6 +481,7 @@ Gunasekara et al.\ recently proposed the more general method of \textit{Quantise
Although implementing Quantised Dialog for the relatively simple domain of ADA would be excessive, we could combine some of its features with our existing semantic analysis methods developed for ADA's review aggregations. The user will ask semantically loaded questions about the product and its features, which should be covered by the same feature extraction and feature-dependent sentiment analysis methods as the review texts. If we can extract the semantics behind the queries and group them with one of the explanation requests of Definition \ref{def:argdialogue}, we can answer them with the predefined responses. However, this assumes that the user has sufficient information about the kind of responses the agent can produce.
\subsection{Evaluation using the RRIMS properties}
\label{sec:rrims}
Radlinski et al.\ propose that a conversational search system should satisfy the following five properties, termed the \textit{RRIMS properties}:
......
\chapter{Evaluation Plan}
Our extensions to ADA involve both quantitative and qualitative aspects, such as the accuracy of our sentiment analysis methods and the usability of our interface implementations, respectively. The evaluation is expected to take place in early June, and the plans for quantitative and qualitative assessment are detailed in the following sections.
\section{Quantitative assessment}
We will evaluate our sentiment analysis implementation both individually and together with our feature extraction implementation as part of ADA.
The two sentiment analysis implementations can be evaluated on their own using hand labelled data sets, such as a freely available one from \cite{RefWorks:doc:5e2e107ce4b0bc4691206e2e} for target-dependent Twitter sentiment classification. These results can be compared to each other, and with baseline results in existing target-dependent sentiment analysis papers, discussed in section \ref{sec:sa}. We could possibly also label our own dataset of Amazon reviews, with which we could also test the feature extraction. Then, of course, we face the challenge of what constitutes a feature of a product.
We also evaluate our sentiment analysis and feature extraction implementations as part of ADA. This is possible by comparing the dialectical strength measure for a product to the product's aggregated user rating by calculating their Pearson correlation coefficient (PCC). The intuition is that a closer correspondence between these two figures implies a better accuracy of the agent's semantic understanding. PCC scores for particular Amazon product domains can be compared with the PCC score for a wide domain of products, in order to determine the generality of our method. We can evaluate the performance of the ADA extensions by comparing our PCC score to the one achieved in \cite{RefWorks:doc:5e08939de4b0912a82c3d46c} for Rotten Tomatoes: a somewhat similar score would constitute success as the domain is more challenging. Furthermore, we can possibly evaluate our ADA on the same Rotten Tomatoes review dataset, in order to see how well it generalises to different settings.
The results from the individual evaluation of the feature-dependent sentiment analysis provide information on how well the ADA can distinguish between the different aspects of a product. This contrasts with the overall evaluation of the ADA, which only tells us if it understands the general sentiment towards a product. Quantitative evaluation of the ADA's feature-level understanding was not performed in \cite{RefWorks:doc:5e08939de4b0912a82c3d46c}, so it will be interesting to evaluate.
\section{Qualitative assessment}
We will evaluate the performance of our user interface implementations qualitatively via user feedback, guided by the RRIMS properties defined in section \ref{sec:rrims}. Having two implementations to compare will be fruitful in terms of determining their strengths and weaknesses.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment