INDEX
Explanations
words expressing a preference for a particular option over another
instances of the word "Rather" indicating contrast or comparison
New Auto-Interp
Negative Logits
mberg
-0.65
ammy
-0.63
amba
-0.62
DD
-0.62
MIL
-0.62
[+
-0.61
ORN
-0.60
saf
-0.60
championship
-0.59
Quake
-0.59
POSITIVE LOGITS
Rather
0.86
tons
0.84
Rather
0.83
rather
0.77
tif
0.77
ado
0.75
swer
0.75
itably
0.75
Than
0.71
Instead
0.70
Activations Density 0.004%