INDEX
Explanations
phrases indicating political leaning or directionality
New Auto-Interp
Head Attr Weights
0:0.01
1:0.02
2:0.05
3:0.07
4:0.11
5:0.02
6:0.15
7:0.34
8:0.02
9:0.03
10:0.07
11:0.06
Negative Logits
COMPLE
-2.04
lance
-1.70
ーティ
-1.57
pite
-1.53
ogether
-1.52
ruction
-1.50
birth
-1.47
ソ
-1.46
ャ
-1.46
д
-1.43
POSITIVE LOGITS
shoulders
1.73
plun
1.71
intuition
1.66
shaky
1.64
bandwagon
1.64
volunt
1.63
favorites
1.53
toward
1.51
leans
1.49
directional
1.48
Activations Density 0.003%