INDEX
Explanations
references to specific individuals, particularly political figures
New Auto-Interp
Head Attr Weights
0:0.06
1:0.12
2:0.08
3:0.09
4:0.08
5:0.08
6:0.08
7:0.06
8:0.07
9:0.09
10:0.07
11:0.07
Negative Logits
wra
-1.77
Safari
-1.76
vine
-1.73
�
-1.73
osen
-1.69
atal
-1.66
osity
-1.66
ás
-1.64
×
-1.63
selection
-1.63
POSITIVE LOGITS
wiser
2.19
seless
2.07
dwelling
1.87
mentally
1.80
unaccount
1.73
OY
1.73
interpreting
1.71
mourning
1.71
conscientious
1.68
olean
1.66
Activations Density 0.000%