INDEX
Explanations
negative sentiments or actions related to authority and governance
New Auto-Interp
Head Attr Weights
0:0.03
1:0.02
2:0.16
3:0.20
4:0.04
5:0.05
6:0.02
7:0.03
8:0.10
9:0.09
10:0.10
11:0.11
Negative Logits
ritten
-1.34
rification
-1.32
atal
-1.29
pione
-1.22
rament
-1.19
ategor
-1.17
lance
-1.13
bear
-1.12
Leban
-1.09
Polar
-1.09
POSITIVE LOGITS
"...
1.50
Pastebin
1.38
"…
1.33
himself
1.29
herself
1.29
tone
1.27
gging
1.27
Dialogue
1.25
Instead
1.24
"[
1.22
Activations Density 0.392%