INDEX
Explanations
mentions of personal preferences or favorites
expressions of personal preferences or favorites
New Auto-Interp
Negative Logits
heed
-0.90
aping
-0.84
thur
-0.79
ijk
-0.79
asse
-0.78
ural
-0.78
aton
-0.77
yrinth
-0.76
amping
-0.76
enthusi
-0.76
POSITIVE LOGITS
favorite
1.12
Favorite
1.01
Favorite
0.96
favorites
0.92
favourite
0.90
é¾įå¥ij士
0.86
darling
0.85
favorite
0.85
="#
0.76
watering
0.75
Activations Density 0.013%