INDEX
Explanations
comparative relationships between two concepts, where one concept is usually favorable or advantageous over the other
comparative phrases that quantify improvement or decline
New Auto-Interp
Negative Logits
aws
-0.73
sr
-0.73
ittees
-0.71
tags
-0.71
nar
-0.68
brew
-0.67
ptives
-0.66
hack
-0.65
kus
-0.65
stan
-0.63
POSITIVE LOGITS
uliffe
0.72
likely
0.72
Farage
0.68
actu
0.67
healthier
0.66
likelihood
0.66
incentive
0.66
payoff
0.65
flux
0.63
cyclopedia
0.63
Activations Density 0.053%