INDEX
Explanations
words related to misleading narratives and sensationalism in news or discourse
New Auto-Interp
Head Attr Weights
0:0.03
1:0.04
2:0.08
3:0.20
4:0.03
5:0.02
6:0.12
7:0.11
8:0.04
9:0.08
10:0.08
11:0.11
Negative Logits
omething
-1.25
predec
-1.18
ngth
-1.17
Wonders
-1.13
ortium
-1.11
disadvant
-1.10
htaking
-0.97
athed
-0.97
cedented
-0.95
AAF
-0.94
POSITIVE LOGITS
�
1.10
perty
1.00
argument
0.97
ody
0.96
Isaac
0.94
�
0.94
java
0.90
Ly
0.89
ة
0.89
Chicken
0.88
Activations Density 0.001%