INDEX
Explanations
references to violence or attacks
New Auto-Interp
Negative Logits
lass
-0.16
лÑıв
-0.16
legg
-0.15
entr
-0.15
Islam
-0.14
Dans
-0.14
Dans
-0.14
RECEIVER
-0.14
_mgr
-0.14
leston
-0.13
POSITIVE LOGITS
tps
0.16
rencont
0.15
ground
0.15
oteca
0.15
orts
0.15
ż
0.15
Bi
0.14
éľ
0.14
assen
0.14
θο
0.14
Activations Density 0.027%