INDEX
Explanations
comparisons asking to choose between options
references to choices or options in various contexts
New Auto-Interp
Negative Logits
ļéĨĴ
-0.87
enture
-0.78
limited
-0.72
²¾
-0.69
zek
-0.68
isse
-0.67
GGGGGGGG
-0.65
paren
-0.64
livious
-0.64
©¶æ¥µ
-0.62
POSITIVE LOGITS
suits
1.10
best
1.04
dominates
1.01
corresponds
1.00
wins
0.95
fits
0.95
BEST
0.92
deserves
0.89
tops
0.87
inspires
0.85
Activations Density 0.176%