INDEX
Explanations
instances of words describing violence or brutality
New Auto-Interp
Negative Logits
retty
-0.17
onya
-0.16
_GUID
-0.14
Gardens
-0.14
IMIT
-0.14
ihat
-0.14
itel
-0.14
iye
-0.14
cip
-0.14
Dess
-0.14
POSITIVE LOGITS
anc
0.15
lemek
0.14
evil
0.14
ÄĻk
0.14
fleet
0.14
agon
0.14
APE
0.14
storm
0.14
prim
0.14
Allocator
0.14
Activations Density 0.001%