Qualitative Evaluation:
Our Surface Realization module achieved a sentence-level BLUE4-score
of 74.67% on the test set.
Quantitative Evaluation:
We asked 42 participants on Amazon Mechanical Turk to
navigate a three-dimensional virtual environment according
to a provided route instruction. The route instructions were
randomly sampled from those generated using our method and those
provided by humans as part of the SAIL corpus.
No participants experienced the same scenario with both human
annotated and machine-generated instructions.
We evaluate the accuracy with which human participants followed
the natural language instructions in terms of the Manhattan distance
between the desired destination and the participant’s location when
s/he finished the scenario. Results shown in Figure 1.
Figure 1:
Participants’ distances from the goal.
The participants were presented with a survey consisting of eight questions,
three requesting demographic information and five requesting
feedback on their experience and the quality of the instructions that they
followed (Figures 2 to 6).
Figure 2:
How would you evaluate the task in terms of difficulty?
Figure 3:
How many times did you have to backtrack?
Figure 4:
Who do you think generated the instructions?
Figure 5:
How would you define the amount of information provided by the instructions?
Figure 6:
How confident are you that you followed the desired path?