INDEX
Explanations
statements related to moral dilemmas and ethical judgments
New Auto-Interp
Negative Logits
itat
-0.14
641
-0.14
orem
-0.13
оно
-0.13
ERCHANT
-0.13
ameleon
-0.13
dubious
-0.13
aus
-0.13
åģ¥
-0.13
ess
-0.13
POSITIVE LOGITS
easier
0.21
better
0.20
true
0.19
raining
0.19
happening
0.19
incumbent
0.18
supposed
0.18
coincidence
0.18
against
0.17
necessary
0.17
Activations Density 0.570%