INDEX
Explanations
comparisons between different situations or entities
comparative phrases that highlight preferences or contrasts
New Auto-Interp
Negative Logits
estern
-0.86
illary
-0.75
ero
-0.72
umbn
-0.70
roo
-0.68
ahime
-0.68
OSP
-0.67
ango
-0.67
eri
-0.67
uron
-0.66
POSITIVE LOGITS
anything
1.26
any
1.05
ever
0.96
anybody
0.94
anyone
0.94
vice
0.81
necessarily
0.80
usual
0.75
actual
0.75
anywhere
0.74
Activations Density 0.102%