INDEX
Explanations
references to violent acts or conflicts
New Auto-Interp
Negative Logits
ality
-0.18
esta
-0.17
olu
-0.15
faction
-0.15
ally
-0.15
.au
-0.15
aled
-0.14
owie
-0.14
arity
-0.14
erator
-0.14
POSITIVE LOGITS
ively
0.19
IVEN
0.16
kre
0.15
/mock
0.15
iveness
0.15
ersh
0.15
ademic
0.15
able
0.14
robe
0.14
erson
0.14
Activations Density 0.046%