INDEX
Explanations
phrases related to physical violence and injury
New Auto-Interp
Negative Logits
ponential
-0.16
noch
-0.16
Consortium
-0.16
atte
-0.15
665
-0.15
arness
-0.14
neau
-0.14
uum
-0.14
polar
-0.13
colore
-0.13
POSITIVE LOGITS
CHASE
0.16
unprotected
0.15
ós
0.15
tons
0.15
hay
0.15
-sensitive
0.15
sensitive
0.14
head
0.14
hay
0.14
ungan
0.14
Activations Density 0.028%