INDEX
Explanations
instances of the word "kill" and its variations related to violence
New Auto-Interp
Negative Logits
iland
-0.18
onto
-0.15
ulled
-0.15
/out
-0.14
iÃŁ
-0.14
/Framework
-0.14
Blur
-0.14
elect
-0.14
oha
-0.14
land
-0.14
POSITIVE LOGITS
off
0.23
joy
0.20
spree
0.20
/disable
0.20
switch
0.19
æĪ
0.19
lier
0.18
deer
0.18
çİ°åľº
0.17
ibri
0.17
Activations Density 0.047%