INDEX
Explanations
phrases related to moral and ethical judgments
New Auto-Interp
Negative Logits
lero
-0.07
rip
-0.07
Ỽt
-0.07
_preds
-0.06
šti
-0.06
наÑĩе
-0.06
гл
-0.06
instead
-0.06
ضÙħÙĨ
-0.06
pena
-0.06
POSITIVE LOGITS
physical
0.15
physical
0.14
Physical
0.13
overt
0.13
direct
0.13
directly
0.12
Physical
0.12
direct
0.11
obvious
0.11
physically
0.11
Activations Density 0.058%