INDEX
Explanations
comparisons or choices between different entities
comparisons or contrasts between two entities or ideas
New Auto-Interp
Negative Logits
shire
-0.79
lied
-0.77
overed
-0.72
olog
-0.71
iola
-0.71
estone
-0.71
ERN
-0.70
ortal
-0.70
ogen
-0.70
YD
-0.69
POSITIVE LOGITS
hill
0.68
theirs
0.65
pecting
0.62
creen
0.60
bandits
0.60
USPS
0.59
expecting
0.59
nil
0.58
await
0.58
hers
0.57
Activations Density 0.020%