INDEX
Explanations
phrases indicating similarity or comparability
phrases emphasizing comparison and similarity
New Auto-Interp
Negative Logits
utherford
-0.79
orer
-0.77
orsi
-0.70
omore
-0.67
onite
-0.65
wat
-0.65
irens
-0.64
illard
-0.63
erella
-0.63
izons
-0.62
POSITIVE LOGITS
manner
1.71
fashion
1.42
ways
1.38
way
1.30
vein
1.19
contexts
1.11
context
1.08
sense
1.05
terms
1.05
guise
1.04
Activations Density 0.270%