INDEX
Explanations
phrases related to forceful actions or interventions
references to coercive actions and violence
New Auto-Interp
Negative Logits
Prediction
-0.71
paragraph
-0.70
ancial
-0.69
purpose
-0.69
daily
-0.68
ership
-0.68
orno
-0.67
sal
-0.67
Purpose
-0.67
rug
-0.64
POSITIVE LOGITS
forcefully
1.13
forcibly
1.07
dru
0.83
steril
0.82
kissed
0.82
shoved
0.82
awoken
0.79
overpowered
0.79
violently
0.78
avage
0.77
Activations Density 0.011%