INDEX
Explanations
questions starting with "Would" and presenting hypothetical scenarios
questions posed to the reader
New Auto-Interp
Negative Logits
ãĤĬ
-0.67
natureconservancy
-0.61
SPONSORED
-0.60
ãģĮ
-0.59
minus
-0.58
displayText
-0.58
ãĢĤ
-0.58
ãģ«
-0.57
hig
-0.57
traced
-0.57
POSITIVE LOGITS
anyone
1.04
n
1.03
anybody
1.02
you
0.94
it
0.86
they
0.83
somebody
0.82
we
0.81
someone
0.81
YOU
0.74
Activations Density 0.074%