INDEX
Explanations
mentions of "killing" or related actions
New Auto-Interp
Negative Logits
iland
-0.21
eck
-0.17
iyel
-0.15
sez
-0.15
evice
-0.15
onto
-0.14
asley
-0.14
aland
-0.14
ulled
-0.14
leh
-0.14
POSITIVE LOGITS
off
0.25
joy
0.23
outright
0.23
spree
0.23
deer
0.22
innocent
0.21
-off
0.21
indiscrim
0.21
çİ°åľº
0.20
switch
0.19
Activations Density 0.057%