INDEX
Explanations
references to governmental policies and political discussions
New Auto-Interp
Negative Logits
isode
-0.69
regist
-0.67
board
-0.66
sled
-0.64
Manhattan
-0.63
scatter
-0.62
decomp
-0.62
charm
-0.61
mans
-0.60
scene
-0.60
POSITIVE LOGITS
¬
1.11
Ĵ
1.05
¡
1.04
¤
1.00
ij
0.97
Ļ
0.96
ı
0.95
ľ
0.93
_.
0.93
Ī
0.91
Activations Density 0.479%