INDEX
Explanations
phrases containing words that express positivity or admiration
phrases that express preference for something being the best or better option
New Auto-Interp
Negative Logits
naires
-0.74
stairs
-0.70
adra
-0.69
sembly
-0.64
ortment
-0.64
giene
-0.64
area
-0.63
hiro
-0.60
wash
-0.60
aine
-0.59
POSITIVE LOGITS
than
0.93
testament
0.81
encaps
0.80
Than
0.76
succinct
0.74
nor
0.72
deserving
0.70
juxtap
0.69
exempl
0.68
illustration
0.67
Activations Density 0.114%