INDEX
Explanations
mentions of specific names or entities in news articles
proper nouns, specifically names of individuals and entities
New Auto-Interp
Negative Logits
otine
-0.84
izations
-0.84
illian
-0.79
ivals
-0.79
ians
-0.76
urgy
-0.76
icked
-0.75
ais
-0.73
ous
-0.73
Pengu
-0.72
POSITIVE LOGITS
Dee
1.00
pling
0.92
zie
0.89
bris
0.86
gradation
0.83
lde
0.82
ples
0.80
velop
0.78
plin
0.78
ble
0.77
Activations Density 0.043%