INDEX
Explanations
verbs related to actions or events
statements about deception or manipulation
New Auto-Interp
Negative Logits
accompanied
-0.75
fter
-0.68
fters
-0.65
critical
-0.65
ggles
-0.63
foreseen
-0.62
Approximately
-0.62
è¦ļéĨĴ
-0.60
uers
-0.60
cone
-0.60
POSITIVE LOGITS
themselves
1.39
THEIR
0.88
their
0.85
us
0.81
li
0.80
uniforms
0.72
selves
0.71
fools
0.70
selves
0.66
me
0.65
Activations Density 0.714%