INDEX
Explanations
phrases indicating opposition or resistance to various actions or policies
New Auto-Interp
Head Attr Weights
0:0.02
1:0.05
2:0.09
3:0.04
4:0.02
5:0.06
6:0.05
7:0.05
8:0.14
9:0.07
10:0.08
11:0.27
Negative Logits
lia
-1.23
condensed
-1.18
optimized
-1.14
exqu
-1.12
FINEST
-1.11
staking
-1.10
itone
-1.10
manif
-1.09
insulated
-1.06
poked
-1.06
POSITIVE LOGITS
anymore
1.39
cause
1.31
altogether
1.17
erous
1.16
Citation
1.16
whatsoever
1.16
comments
1.16
injust
1.15
harms
1.15
hurting
1.13
Activations Density 0.076%