Interactive storytelling systems are capable of producing many variants of stories. A major challenge in designing storytelling systems is the evaluation of the resulting narrative.
Ideally every variant of the resulting story should be seen and evaluated, but due to combinatorial explosion of the story space, this is unfeasible in all but the smallest domains. However, the system designer still needs to have control over the generated stories and his input cannot be replaced by a computer.
In this paper we propose a general methodology for semi-automatic evaluation of narrative systems based on tension curve extraction and clustering of similar stories. Our preliminary results indicate that a straightforward approach works well in simple scenarios, but for complex story spaces further improvements are necessary.