INDEX
Explanations
references to violence and its various forms and implications
New Auto-Interp
Negative Logits
horn
-0.17
rying
-0.16
acter
-0.16
ÑĢÑıдÑĥ
-0.16
timeofday
-0.15
åij³
-0.15
edException
-0.15
alian
-0.14
owitz
-0.14
si
-0.14
POSITIVE LOGITS
ERTICAL
0.15
ocities
0.14
OLUME
0.14
ł
0.14
argo
0.13
vens
0.13
_manual
0.13
endo
0.13
/or
0.13
ergy
0.13
Activations Density 0.028%