INDEX
Explanations
instances of violence or gun-related actions
New Auto-Interp
Negative Logits
LOAT
-0.16
endir
-0.16
cru
-0.15
Pipes
-0.15
agg
-0.15
ãģ£ãģı
-0.15
UnderTest
-0.14
igated
-0.14
533
-0.14
uctor
-0.14
POSITIVE LOGITS
aub
0.17
obus
0.15
meli
0.14
AZY
0.14
salv
0.14
nist
0.13
BG
0.13
MG
0.13
utters
0.13
az
0.13
Activations Density 0.068%