INDEX
Explanations
negative descriptors or adjectives associated with moral judgment
wickedness and sin
New Auto-Interp
Negative Logits
exitRule
-0.72
IFA
-0.65
arşivlendi
-0.63
EEU
-0.62
IFA
-0.62
Seitz
-0.59
igy
-0.57
nasium
-0.56
nup
-0.56
Wikimedijinoj
-0.56
POSITIVE LOGITS
Wicked
1.87
Wicked
1.70
wicked
1.69
wicked
1.68
wickedness
0.96
恶
0.48
惡
0.44
vicious
0.43
Wild
0.43
nonlinear
0.43
Activations Density 0.001%