INDEX
Explanations
instances of the word "We" and its variants, indicating collective statements or intentions
New Auto-Interp
Negative Logits
“
-0.20
._↵
-0.18
âĢŀ
-0.17
-↵
-0.16
.*↵
-0.16
--↵
-0.16
--↵
-0.15
-↵
-0.15
'↵
-0.15
—↵
-0.15
POSITIVE LOGITS
ir
0.43
apons
0.41
bsite
0.41
ng
0.38
gether
0.37
ek
0.37
ather
0.36
ory
0.35
ide
0.32
thing
0.32
Activations Density 0.110%