INDEX
Explanations
instances of violence or aggressive actions
New Auto-Interp
Negative Logits
ught
-0.08
prav
-0.07
.createObject
-0.07
eed
-0.06
ấn
-0.06
fal
-0.06
meis
-0.06
spo
-0.06
ÏĦαι
-0.06
íĻĺ
-0.06
POSITIVE LOGITS
into
0.09
off
0.08
away
0.08
into
0.07
ano
0.07
anos
0.07
aran
0.06
Geh
0.06
_into
0.06
back
0.06
Activations Density 0.068%