INDEX
Explanations
terms related to tolerance policies, especially in the context of governance and behavior regulation
New Auto-Interp
Head Attr Weights
0:0.03
1:0.01
2:0.10
3:0.07
4:0.10
5:0.03
6:0.05
7:0.36
8:0.03
9:0.04
10:0.09
11:0.05
Negative Logits
estamp
-1.92
window
-1.70
prototypes
-1.54
alter
-1.53
hook
-1.51
Ago
-1.51
�
-1.49
ements
-1.48
prints
-1.47
セ
-1.46
POSITIVE LOGITS
cruelty
1.79
Violence
1.72
harassment
1.71
abuse
1.69
homophobia
1.66
manslaughter
1.66
dealing
1.65
criminally
1.64
racism
1.63
discrimination
1.63
Activations Density 0.000%