Deriving Explainable Discriminative Attributes Using Confusion About Counterfactual Class
Nakyeong Yang, Taegwan Kang, Kyomin Jung
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:05:32
Recently, Integrated Gradients-based (IG) methods have been commonly used to explain the decision process of deep neural networks (DNNs). However, they have only considered the information of the predicted class while neglecting the information of the rest classes. In this paper, we propose a novel counterfactual explanation method, Discriminative Gradients (DiscGrad) that derives explainable discriminative attributes by considering not only the predicted class but also the counterfactual classes. Specifically, we calculate the discriminative attributes by removing the attribute of the counterfactual classes, and this process makes it possible to derive only key discriminative attributes that contrast with other decisions. Also, we determine the weights for discriminative attributes using the degree of confusion about counterfactual classes. We evaluated our method by measuring how much logit decreases by perturbing important attributes. Experimental results on the widely used image and text datasets show that our proposed method outperforms the strong baseline, IG. In addition, we examine the relationship between class correlation and the performance of discriminative attribute to demonstrate the effectiveness of our method.