INDEX
Explanations
phrases related to social or political rejection or disapproval
New Auto-Interp
Head Attr Weights
0:0.08
1:0.03
2:0.19
3:0.07
4:0.12
5:0.08
6:0.03
7:0.02
8:0.14
9:0.10
10:0.05
11:0.03
Negative Logits
Angelo
-1.14
Cec
-1.11
insula
-1.09
Cinem
-1.05
Lund
-1.04
Surviv
-1.04
Romero
-1.03
Omaha
-1.03
Nebula
-1.02
Philipp
-1.02
POSITIVE LOGITS
ppings
1.25
virtues
1.25
cuts
1.24
bole
1.21
altogether
1.20
charms
1.16
agin
1.12
Downloadha
1.11
ocratic
1.11
roots
1.11
Activations Density 0.002%