INDEX
Explanations
mentions of prestigious titles or awards
proper nouns, particularly names and titles
New Auto-Interp
Negative Logits
enegger
-1.11
schild
-0.83
icago
-0.81
bracelet
-0.73
lihood
-0.71
Wink
-0.69
Beckham
-0.66
ded
-0.65
dred
-0.64
ORGE
-0.63
POSITIVE LOGITS
ests
1.15
zes
1.07
ety
1.01
esses
0.98
eties
0.89
heed
0.87
quet
0.87
vy
0.86
ific
0.86
ë
0.86
Activations Density 0.009%