To cluster executions that exhibit faulty behavior by the faults that cause them, researchers have proposed using internal execution events, such as statement profiles, to (1) measure execution similarities, (2) categorize executions based on those similarity results, and (3) suggest the resulting categories as sets of executions exhibiting uniform fault behavior. However, due to a paucity of evidence correlating profiles and output behavior, researchers employ multiple simplifying assumptions in order to justify such approaches. In this paper we present an empirical study of profile correlation with output behavior, and we reexamine the suitability of such simplifying assumptions. We examine over 4 billion test-case outputs and execution profiles from multiple programs with over 9000 versions. Our data provides evidence that with current techniques many executions should be omitted from the clustering analysis to provide clusters that each represent a single fault. In addition, our data reveals the previously undocumented effects of multiple faults on failures, which has implications for techniques’ ability (and inability) to properly cluster. Our results suggest directions for the improvement of future failure-clustering techniques that better account for software-fault behavior.