INDEX
Explanations
mentions of political figures and activities related to politics
New Auto-Interp
Negative Logits
Sidd
-1.12
IELD
-1.02
UGE
-1.01
ADE
-0.99
ORGE
-0.99
enegger
-0.95
Goldberg
-0.94
VEL
-0.89
HEAD
-0.87
Learning
-0.87
POSITIVE LOGITS
itives
1.79
itive
1.72
§
1.54
ĭ
1.53
otaur
1.52
eties
1.52
Ľ
1.51
İ
1.48
acy
1.47
į
1.43
Activations Density 2.188%