INDEX
Explanations
phrases discussing political questions and controversies
New Auto-Interp
Head Attr Weights
0:0.06
1:0.02
2:0.06
3:0.06
4:0.03
5:0.20
6:0.09
7:0.07
8:0.05
9:0.06
10:0.16
11:0.07
Negative Logits
halftime
-1.13
ensions
-1.03
umn
-0.96
venants
-0.94
ukong
-0.91
Railroad
-0.88
Quan
-0.88
Sep
-0.88
Celest
-0.86
Verse
-0.85
POSITIVE LOGITS
!).
1.57
)."
1.57
?).
1.56
)</
1.54
).[
1.42
).
1.37
)}
1.36
?)
1.36
!)
1.35
\)
1.33
Activations Density 0.359%