INDEX
Explanations
phrases related to suppression or control of various aspects, including emotions and expressions
New Auto-Interp
Head Attr Weights
0:0.02
1:0.03
2:0.06
3:0.12
4:0.01
5:0.04
6:0.06
7:0.11
8:0.12
9:0.18
10:0.07
11:0.13
Negative Logits
largeDownload
-1.30
urai
-1.21
sonian
-1.17
itz
-1.15
Zup
-1.12
inav
-1.11
liest
-1.11
psc
-1.10
hani
-1.09
yssey
-1.08
POSITIVE LOGITS
alien
1.29
altogether
1.20
dissent
1.18
urities
1.15
extremism
1.10
whatsoever
1.09
democratically
1.05
falsely
1.05
stigma
1.05
terrorism
1.04
Activations Density 0.013%