INDEX
Explanations
rankings or lists of items within different categories
phrases indicating rankings or lists of top items
New Auto-Interp
Negative Logits
norm
-0.84
icum
-0.83
redo
-0.83
roth
-0.81
protection
-0.76
arantine
-0.75
limited
-0.75
amar
-0.74
athered
-0.74
amination
-0.73
POSITIVE LOGITS
Favorite
1.02
Worst
0.86
Influ
0.85
quotes
0.79
celeb
0.76
Bucket
0.76
unsolved
0.76
Celebrity
0.76
Places
0.75
Songs
0.74
Activations Density 0.305%