INDEX
Explanations
references to specific individuals or critiques of societal figures
New Auto-Interp
Head Attr Weights
0:0.03
1:0.04
2:0.11
3:0.11
4:0.02
5:0.03
6:0.20
7:0.10
8:0.07
9:0.10
10:0.06
11:0.07
Negative Logits
FORMATION
-1.15
virtues
-1.11
deception
-1.11
��
-1.09
contingent
-1.05
CLASSIFIED
-1.02
ailability
-1.00
suspic
-1.00
reversible
-1.00
ulative
-0.99
POSITIVE LOGITS
pload
1.19
iak
1.18
mith
1.16
schild
1.14
opal
1.12
hend
1.07
Fraz
1.05
nik
1.05
yi
1.04
adish
1.02
Activations Density 0.003%