INDEX
Explanations
phrases mentioning specific names or proper nouns, especially related to news reporting or media personalities
mentions of a specific individual, likely a prominent figure in media or politics
New Auto-Interp
Negative Logits
stood
-0.78
historic
-0.69
angers
-0.68
omorphic
-0.68
population
-0.66
ties
-0.66
validation
-0.65
virginity
-0.64
relegation
-0.64
stocking
-0.62
POSITIVE LOGITS
GOODMAN
2.00
gdala
1.02
APH
0.93
ENN
0.84
AN
0.81
ONES
0.79
ENCY
0.79
NER
0.78
AMY
0.77
aku
0.77
Activations Density 0.005%