INDEX
Explanations
phrases indicating ranking or superiority
New Auto-Interp
Negative Logits
za
-0.18
inic
-0.16
ansen
-0.16
er
-0.15
top
-0.15
rna
-0.15
ients
-0.15
esses
-0.15
hips
-0.14
ÏĨι
-0.14
POSITIVE LOGITS
-notch
0.37
pling
0.34
-rated
0.32
-tier
0.31
most
0.30
notch
0.29
ographical
0.28
ography
0.27
tier
0.26
-flight
0.26
Activations Density 0.023%