INDEX
Explanations
instances of physical conflict or violence
New Auto-Interp
Negative Logits
inki
-0.18
pornos
-0.15
ìłĿ
-0.15
æŀª
-0.15
incy
-0.15
ÑĩеÑĢв
-0.15
unfavor
-0.14
iddi
-0.14
ãİ
-0.14
λοÏį
-0.14
POSITIVE LOGITS
fist
0.30
fists
0.27
punches
0.27
punch
0.26
physical
0.26
boxing
0.25
violence
0.24
punching
0.24
physically
0.24
punched
0.23
Activations Density 0.151%