INDEX
Explanations
phrases indicating emotional distress or critical situations
New Auto-Interp
Negative Logits
.faces
-0.16
doom
-0.15
из
-0.14
arkin
-0.14
opro
-0.14
맨
-0.14
captivity
-0.14
_EMPTY
-0.14
oder
-0.13
catastrophic
-0.13
POSITIVE LOGITS
Innoc
0.20
innocent
0.20
innoc
0.18
undef
0.18
reput
0.16
Affected
0.15
Attempts
0.15
ammen
0.15
Reputation
0.15
.Undef
0.14
Activations Density 0.014%