INDEX
Explanations
words that denote various forms of criticism or negativity towards entities or behaviors
New Auto-Interp
Head Attr Weights
0:0.04
1:0.02
2:0.16
3:0.04
4:0.18
5:0.09
6:0.03
7:0.03
8:0.11
9:0.17
10:0.05
11:0.02
Negative Logits
��
-1.64
��
-1.62
ufact
-1.52
acea
-1.49
QL
-1.46
ドラ
-1.45
Si
-1.40
omnia
-1.34
theless
-1.32
ña
-1.32
POSITIVE LOGITS
Brom
1.38
Cle
1.36
Lamar
1.32
Hank
1.30
Mer
1.29
Mort
1.21
Klu
1.20
agall
1.19
Clay
1.19
Louis
1.19
Activations Density 0.006%