INDEX
Explanations
mentions of personal preferences or favorite things
mentions of personal favorites
New Auto-Interp
Negative Logits
aton
-0.77
raz
-0.76
asse
-0.76
aping
-0.74
rene
-0.74
acial
-0.73
thur
-0.73
apers
-0.72
ural
-0.72
rain
-0.70
POSITIVE LOGITS
favorites
1.48
favourites
1.32
favorite
1.09
é¾įå¥ij士
0.97
favorite
0.96
Favorite
0.90
favourite
0.90
itism
0.87
fav
0.86
Favor
0.84
Activations Density 0.006%