Classification techniques have been used in software-engineering research to perform tasks such as categorizing software executions. Traditionally, existing work has proposed single-label failure classification techniques, in which the training and subsequent executions are labeled with a singular fault attribution. Although such approaches have received substantial attention in research on automated software engineering, in reality, recent work shows that the assumption of such a single attribution is often unrealistic: in practice, the inherent characteristics of software behavior, such as multiple faults that contribute to failures and fault interactions, may negatively influence the effectiveness of these techniques. To relax this unrealistic assumption, in the machine learning field, researchers have proposed new approaches for multi-label classification. However, the effectiveness and efficiency of such approaches varies widely based upon application domains. In this paper, we empirically investigate the performance of these new approaches on the failure classification task under different application settings. We conducted experiments using eight classification techniques on five subject programs with more than 8,000 faulty versions to investigate how each such technique accounts for the intricacies of software behavior. Our experimental results show that multi-label techniques provide improved accuracy over single-label. We also evaluated the efficiency of the training and prediction phases of each technique, and offer guidance as to the applicability for each technique for different usage contexts.