INDEX
Explanations
mentions of violence and its various contexts or impacts
New Auto-Interp
Negative Logits
opa
-0.17
lify
-0.16
oga
-0.16
akin
-0.15
ublish
-0.15
иÑĪ
-0.15
_printf
-0.14
овÑĭй
-0.14
cheid
-0.14
ampoo
-0.14
POSITIVE LOGITS
directed
0.22
toward
0.21
towards
0.21
Against
0.21
against
0.21
Tow
0.20
/ag
0.18
Towards
0.18
ive
0.17
Against
0.15
Activations Density 0.031%