INDEX
Explanations
comparisons where one option is explicitly contrasted with another
phrases that present contrasting ideas or alternatives
New Auto-Interp
Negative Logits
enegger
-0.64
FANT
-0.64
Dahl
-0.61
boys
-0.60
redo
-0.59
Kard
-0.59
chairs
-0.57
Semin
-0.57
fly
-0.56
mberg
-0.55
POSITIVE LOGITS
itably
0.79
necessarily
0.73
thereto
0.73
icip
0.72
arily
0.68
ifice
0.68
agonist
0.66
thodox
0.66
ively
0.66
oldown
0.63
Activations Density 0.025%