INDEX
Explanations
references to personal favorites or preferences
New Auto-Interp
Negative Logits
er
-0.84
l
-0.70
I
-0.69
</em>
-0.67
In
-0.67
ers
-0.66
r
-0.66
was
-0.65
man
-0.64
n
-0.64
POSITIVE LOGITS
favorites
1.61
Favorites
1.57
favorite
1.56
Favorite
1.54
favourite
1.52
favourites
1.52
Favourite
1.51
favourite
1.46
FAVORITE
1.44
favorite
1.44
Activations Density 0.042%