INDEX
Explanations
mentions of politics and corruption
New Auto-Interp
Negative Logits
Cancel
-0.90
actory
-0.84
imates
-0.77
wolves
-0.74
oa
-0.72
amination
-0.71
tered
-0.70
Takeru
-0.70
wered
-0.70
olen
-0.68
POSITIVE LOGITS
correctness
1.37
eering
0.97
activism
0.90
intrig
0.89
pund
0.88
appoint
0.88
rhetoric
0.87
clout
0.82
affiliation
0.81
affili
0.81
Activations Density 1.441%