INDEX
Explanations
phrases that describe unverified claims or contentious statements
New Auto-Interp
Head Attr Weights
0:0.07
1:0.01
2:0.07
3:0.26
4:0.02
5:0.09
6:0.03
7:0.08
8:0.03
9:0.05
10:0.16
11:0.08
Negative Logits
secondary
-1.10
anners
-1.03
hero
-1.03
rals
-1.00
%;
-0.98
nikov
-0.97
practice
-0.97
>.
-0.95
selection
-0.95
anking
-0.94
POSITIVE LOGITS
SPONSORED
1.10
incidentally
1.04
anat
1.00
presumably
0.99
Brach
0.96
ought
0.96
translates
0.95
��
0.93
ieri
0.92
lacks
0.92
Activations Density 0.272%