INDEX
Explanations
instances where one option or action is preferred over another
the repeated use of the word "instead."
New Auto-Interp
Negative Logits
cision
-0.72
ongo
-0.71
mud
-0.67
fried
-0.65
lees
-0.64
raz
-0.64
anon
-0.64
minent
-0.64
Shake
-0.63
rament
-0.63
POSITIVE LOGITS
opting
0.88
instead
0.84
instead
0.80
preferring
0.70
chose
0.70
opt
0.68
cannabin
0.68
opted
0.67
passively
0.67
artments
0.67
Activations Density 0.020%