INDEX
Explanations
discussions or phrases related to criticism or backlash
New Auto-Interp
Head Attr Weights
0:0.01
1:0.01
2:0.10
3:0.06
4:0.10
5:0.03
6:0.03
7:0.38
8:0.04
9:0.04
10:0.10
11:0.05
Negative Logits
irlf
-1.90
redo
-1.78
arnaev
-1.73
inth
-1.70
Grave
-1.65
raits
-1.64
ixt
-1.64
iece
-1.64
otted
-1.59
eros
-1.58
POSITIVE LOGITS
endorsing
2.18
antiv
2.14
disapproval
2.11
advis
2.10
endorsement
2.10
censorship
2.08
behavi
2.04
approving
1.97
omission
1.95
endorsements
1.93
Activations Density 0.000%