INDEX
Explanations
mentions of personal preferences or favorites
instances of the word "favorite" and its variations
New Auto-Interp
Negative Logits
aping
-0.86
ural
-0.85
heed
-0.84
attle
-0.79
aton
-0.79
atan
-0.78
urers
-0.77
arin
-0.74
athered
-0.74
sten
-0.74
POSITIVE LOGITS
Favorite
1.02
favorite
0.87
pokemon
0.79
whipping
0.76
Favorite
0.75
moments
0.73
darling
0.73
hobbies
0.72
favorites
0.72
sibling
0.71
Activations Density 0.021%