INDEX
Explanations
statements related to reasoning and justification
New Auto-Interp
Negative Logits
aint
-0.16
485
-0.14
Trab
-0.14
ży
-0.13
roy
-0.13
tpl
-0.13
484
-0.13
/cat
-0.13
ola
-0.13
365
-0.13
POSITIVE LOGITS
why
0.28
reasons
0.27
Reasons
0.23
why
0.23
âĹĦ
0.22
reason
0.21
reason
0.20
ìķ½
0.19
.reason
0.19
Reason
0.18
Activations Density 0.207%