INDEX
Explanations
phrases related to manipulation or deception
phrases indicating manipulation or coercion
New Auto-Interp
Negative Logits
summarizes
-0.65
clips
-0.65
traced
-0.64
anwhile
-0.63
fleet
-0.63
remarked
-0.63
lang
-0.63
Internet
-0.63
thora
-0.63
Rosenstein
-0.63
POSITIVE LOGITS
surrender
0.87
submission
0.85
obedience
0.83
acquies
0.72
embrace
0.72
heel
0.71
favour
0.71
believing
0.70
staying
0.70
cooperate
0.69
Activations Density 0.152%