INDEX
Explanations
mentions of favorites or preferences in various contexts
New Auto-Interp
Negative Logits
er
-0.79
I
-0.69
n
-0.66
ers
-0.65
In
-0.65
l
-0.64
(
-0.63
N
-0.63
ra
-0.62
in
-0.61
POSITIVE LOGITS
favorites
1.51
Favorites
1.50
favorite
1.45
Favorite
1.43
favourite
1.39
favourites
1.35
Favourite
1.35
favorites
1.35
favorite
1.35
FAVORITE
1.34
Activations Density 0.039%