INDEX
Explanations
phrases contrasting differences
New Auto-Interp
Negative Logits
ATA
-0.88
vez
-0.82
rollers
-0.77
mberg
-0.72
rive
-0.71
WI
-0.62
kamp
-0.62
ale
-0.60
ITED
-0.60
staples
-0.59
POSITIVE LOGITS
between
1.24
between
1.12
Between
1.07
iveness
1.02
iator
0.98
iating
0.96
ials
0.92
erence
0.91
yip
0.84
maker
0.84
Activations Density 0.040%