INDEX
Explanations
references to claims or allegations regarding deception or misinformation
New Auto-Interp
Head Attr Weights
0:0.16
1:0.03
2:0.09
3:0.04
4:0.04
5:0.05
6:0.15
7:0.04
8:0.06
9:0.23
10:0.03
11:0.03
Negative Logits
acco
-3.53
Hik
-3.31
zech
-3.29
crim
-3.24
coh
-3.23
HK
-3.22
Ic
-3.18
Cooper
-3.17
ongh
-3.17
iolet
-3.17
POSITIVE LOGITS
Rend
8.89
Render
8.22
Render
8.00
render
7.47
render
7.36
rendering
7.25
rendered
6.53
rend
6.38
rendered
6.10
renders
6.07
Activations Density 0.001%