INDEX
Explanations
instructions or options for taking action
phrases indicating options or alternatives
New Auto-Interp
Negative Logits
ocracy
-0.79
Therefore
-0.75
correctness
-0.73
hed
-0.69
eness
-0.69
Merit
-0.67
Thus
-0.59
emen
-0.58
ilty
-0.57
matters
-0.57
POSITIVE LOGITS
alternatively
1.35
chard
1.09
lando
1.08
acles
1.05
Else
1.04
acle
1.01
browse
0.99
else
0.99
chid
0.92
GAN
0.92
Activations Density 0.095%