INDEX
Explanations
references to social or political commentary
New Auto-Interp
Head Attr Weights
0:0.12
1:0.45
2:0.03
3:0.03
4:0.02
5:0.12
6:0.02
7:0.01
8:0.03
9:0.06
10:0.03
11:0.03
Negative Logits
SE
-1.96
asc
-1.93
�
-1.85
Seg
-1.82
�
-1.81
spons
-1.78
SA
-1.73
Sailor
-1.72
ソ
-1.70
�
-1.60
POSITIVE LOGITS
theless
2.17
bage
1.82
bish
1.68
but
1.65
etheless
1.60
gered
1.58
warts
1.56
ebin
1.56
ulous
1.53
but
1.50
Activations Density 0.016%