A Critical Evaluation of ArguMessage

Objective

The research questions addressed are:

1. how easy is it to produce messages using ArguMessage?

2. how satisfied are participants with the messages generated?

Work

The process of creating and confirming the validity of persuasive messages is a cumbersome and time-consuming task, particularly given the lack of domain-independent tools for the purpose of message generation. This work describes an investigation into the effectiveness of ArguMessage, a system that uses argumentation schemes and limited user input to semi-automatically generate persuasive messages encouraging behaviour change that follow specific argumentation patterns. We conducted user studies in the domains of healthy eating and email security to investigate its effectiveness.

Study 1: We used ArguMessage to generate corpora of healthy eating messages.

Participants. We conducted a user study using ArguMessage with laypeople recruited via Amazon Mechanical Turk who had an acceptance rating of at least 90% and was in the United States. This yielded 72 participants, of which 31 were males (5 aged 18-25, 19 aged 26-40, 6 aged 41-65, and 1 aged over 65); and 41 were females (2 aged 18-25, 24 aged 26-40, 13 aged 41-65 and 2 aged over 65). Participants generated a total of 216 messages.

Procedure. Participants were first given instructions explaining what they were required to do, namely, generate three persuasive messages using three “recipes” (argumentation schemes). They were then asked to answer some questions to help ArguMessage generate the messages. Next, the description of a “recipe” was shown (including an example of the message it generates) along with a set of questions that the participant needed to answer to generate a message. Once the participant was happy with their answers, ArguMessage used template-based natural language generation to create a message and present it to the participant. An illustration of the completed participant input is shown above. Finally, participants indicated their satisfaction level with the message generated on a 5-point Likert scale and provided feedback. This was repeated 3 times, for 3 randomly chosen recipes, leading to the generation of 3 messages per participant. The recipes were based on the 14 argumentation schemes shown below.

Results

Participants’ Satisfaction Rating. We calculated the mean of all messages rated under the specific argumentation scheme to determine satisfaction. The highest-rated scheme was ‘Argument from expert opinion with goal’ with a mean of 4.15 and the lowest rated was ‘Argument from values with goal’ with a mean of 2.23 (see table below). For this analysis, all 216 messages were used. For almost all argumentation schemes, satisfaction with the generated messages was rated significantly above the midpoint of the scale for 8 argumentation schemes (see table below), and at the midpoint of the scale for 4 schemes. However, satisfaction was below the midpoint of the scale for ‘Argument from memory with goal’ and ‘Argument from values with goal’. Overall, users were satisfied with the messages.

Mean User Satisfaction Rating of Generated Messages within Argumentation Schemes and p-values for Z-test comparing the mean to 3, and for those not-significantly above 3, to 2

Unexpected User Interactions. Out of 216 messages obtained, we rejected 113 (see table below) and approved 61. In addition, there were 42 messages that had minor grammatical (10 messages), spelling (3), typing (1), punctuation (16), and multiple (12) mistakes which could be considered for approval.

As shown above, there were three main reasons for rejection. First, some participants produced messages that were clearly not about healthy eating, but for example about physical exercise (noted in the table as ‘Different domain’). Second, there were messages where participants had not provided information in the format requested (for example, in Figure 1, the participant is asked to complete the phrase ‘the goal of the user is to’, and a participant may have written a full message instead of completing the phrase (this is noted in the table as ‘Not followed instructions’). Third, there were messages that were identical to the sample messages provided with the scheme (noted in the table as ‘Copied’ if they followed instructions, and ‘Copied and not followed instructions’ if, for example, they copied parts of the sample message as answers for the wrong question).

The table below shows the distribution of the number of messages produced with the 14 argumentation schemes used in the system. The ‘total approved’ is calculated by combining the ‘approved’ and ‘considered to be approved’ messages. The table does not include all rejected messages, as most were copied or a different domain (so, unrelated to difficulty with using a particular argumentation scheme, but rather to the instructions for the system as a whole), however, the number of cases where instructions were not followed may point towards a difficulty with a particular scheme. Overall, the proportion of messages for which people managed to follow the instructions of the argumentation schemes was 84% (86% if excluding copied messages). The proportion was worst for ‘Argument from memory with goal’, where it was 76%. Though the system was quite easy to use, the experimental setup was not clear enough with some participants copying the example message or producing messages which were not about healthy eating.

Mitigation to Unexpected User Interactions. The system was modified to pre-process most of the unexpected user interactions. The system was revised by adding functions to remove or avoid most language mistakes. For example, converting capital letters to lower case, removing additional full-stops, and converting 2nd and 3rd person usage to 1st person usage. Additionally, a training module was incorporated for participants to practice getting an idea of the working of the system before they proceeded to the actual study; they could try it multiple times. The instruction not to copy the example message was emphasized. Before running the email security study, we also removed the two lowest-rated argumentation schemes, i.e., ‘Argument from memory with goal’ and ‘Argument from values with goal’, and the three argumentation schemes that involved liking (i.e., ‘Argument from position to know with goal and liking’, ‘Practical reasoning with goal and liking’ and ‘Practical reasoning with liking’). The latter was done partially because ‘liking’ is harder to conceptualize in the email security domain and partially because previous studies suggested that messages based on liking were rated lowest on perceived persuasiveness.

Study 2: We used ArguMessage to generate corpora of email security messages.

Participants. The study was conducted with participants who have some knowledge or experience with anti-phishing. The link to the study was shared on mailing lists and known contacts. The invitation to take part (without the link) was shared on social media which helped to find domain knowledgeable participants. The study had 40 participants, of which 23 were males (2 aged 18-25, 14 aged 26-40, 5 aged 41-65 and 2 aged over 65), 15 females (1 aged 18-25, 10 aged 26-40 and 4 aged 41-65), and 2 undisclosed. 106 messages were generated.

Procedure. Participants were first given instructions explaining what they were required to do, namely, generate three persuasive messages using three “recipes” (argumentation schemes). They were then asked to answer some questions to help ArguMessage generate the messages. Next, the description of a “recipe” was shown (including an example of the message it generates) along with a set of questions that the participant needed to answer to generate a message. Once the participant was happy with their answers, ArguMessage used template-based natural language generation to create a message and present it to the participant. Finally, participants indicated their satisfaction level with the message generated on a 5-point Likert scale and provided feedback. This was repeated 3 times, for 3 randomly chosen recipes, leading to the generation of 3 messages per participant. The recipes were based on the 14 argumentation schemes shown below (with 9 schemes used in the second study as explained below).

After the first study, the system was improved (see below). An illustration of the completed participant input is shown in the figure above. In this instance, the message generated would be “If you stop trying to check for genuine links in incoming emails now, all your previous efforts will be wasted. Therefore, you ought to continue trying to do that”.

Results

Participants’ Satisfaction Rating. We calculated the mean of all messages rated under the specific argumentation scheme to determine satisfaction. The highest-rated schemes were ‘Argument from position to know with goal’ and ‘Argument from rules with goal’ with a mean of 3.80, and the lowest-rated ‘Argument from sunk cost with action’ with a mean of 2.79 (see table below). For this analysis, all 106 messages were used. Satisfaction ratings for the messages produced by the different schemes are not similar between the two studies and seem a bit lower in this study. This is likely an impact on the domain. However, for all argumentation schemes, satisfaction with the generated messages was still rated significantly above the midpoint of the scale for 3 argumentation schemes (see table below), and at the midpoint of the scale for 6 schemes.

Mean User Satisfaction Rating of Generated Messages within Argumentation Schemes and p-values for Z-test comparing the mean to 3, and for those not-significantly above 3, to 2

Unexpected User Interactions. Out of 106 messages obtained, we rejected 47 (see table below) and approved 46. In addition, there were 12 messages with minor grammar (9 messages) and spelling (3) mistakes that could be considered for approval. These mistakes may be fixed by including further post-processing into the system.

The table below shows the distribution of the number of messages produced with the 9 argumentation schemes used in the system. As before, the ‘total approved’ is calculated by combining the ’approved’ and ‘considered to be approved’ messages. Overall, the proportion of messages for which people managed to follow the instructions of the system was 90%. This makes the system was quite easy to use. The changes we had made after the first study had a positive effect on ease of use. Nevertheless, there were still some participants copying the example message or producing messages which were not about email security.

Reflection

We investigated the effectiveness of ArguMessage in two domains: healthy eating and email security. Whilst the studies used laypeople, the intention ultimately is for the system to be used by domain experts, to guarantee that the messages produced have domain validity. We ran the studies with lay people to check that the system is easy enough to use and does produce messages which are natural enough to satisfy the users. Lay people were used, as domain experts are hard to get, and would spend considerable time worrying about the correctness of the content of the messages (for example, a dietitian may need substantial time to ensure dietary advice is accurate). This would make studies with experts very time-consuming. The studies in this paper ensure that the usability of the system will be good enough for experts to use; if even laypeople can produce messages that adhere to an argumentation scheme, then so will domain experts.

There were some clear issues when our participants used the system. First, a substantial amount of copying from the sample messages took place. This shows that some participants were not clear enough about what was expected from them. After we added some training and made it more explicit not to copy (by bolding the words) in Study 2, the rate of copying reduced from 29% to 25%, which is still substantial. This indicates that a longer, more detailed training session will be needed (before deploying the system, we could, for example, add a video tutor). Second, some participants produced messages that were outside of the domain. This is an issue that would not occur with domain experts. Based on the results, we modified the system slightly between the studies, to add some post-processing, and based on the second study we plan to add some more post-processing. Overall, the effectiveness of generating messages was good when considering those participants who produced original messages applicable to the domain; there were only a limited number of cases where instructions of the scheme were not followed, and there was no scheme that was particularly bad for this. Participants were also generally satisfied with the messages produced.