INDEX
Explanations
phrases expressing preferences for specific things or entities
mentions of "favorite" things or preferences
New Auto-Interp
Negative Logits
heed
-0.83
aping
-0.82
hani
-0.78
acial
-0.78
urches
-0.76
asse
-0.73
ural
-0.73
thur
-0.73
aton
-0.73
pex
-0.72
POSITIVE LOGITS
favorite
1.25
favorites
1.03
favorite
0.99
Favorite
0.97
Favorite
0.97
favourite
0.96
é¾įå¥ij士
0.87
darling
0.84
="#
0.82
ļéĨĴ
0.82
Activations Density 0.011%