INDEX
Explanations
phrases indicating a choice or preference
choices or decisions along with associated preferences
New Auto-Interp
Negative Logits
ankind
-0.76
ipples
-0.63
latent
-0.58
ONY
-0.57
disturbances
-0.56
appre
-0.56
vind
-0.55
pires
-0.55
glimps
-0.55
incidents
-0.55
POSITIVE LOGITS
instead
1.46
instead
1.22
Instead
1.12
Instead
1.10
because
1.05
option
1.04
rather
1.03
alternatives
1.00
lest
0.94
anyways
0.92
Activations Density 0.565%