INDEX
Explanations
preferences or popular choices
instances of the word "favorite" and its variations
New Auto-Interp
Negative Logits
urers
-0.79
inas
-0.77
ural
-0.76
idental
-0.75
OUT
-0.74
heed
-0.74
abeth
-0.74
okin
-0.72
absor
-0.71
roup
-0.71
POSITIVE LOGITS
favorites
0.92
haunt
0.89
favorite
0.84
darling
0.83
underdog
0.76
haun
0.74
amongst
0.73
é¾įå¥ij士
0.70
é¾įå
0.70
whipping
0.69
Activations Density 0.016%