INDEX
Explanations
phrases indicating that something is morally incorrect or unacceptable
phrases indicating the absence of wrongdoing
New Auto-Interp
Negative Logits
incinn
-0.82
cit
-0.81
soType
-0.77
xit
-0.70
earchers
-0.69
weeney
-0.68
ulum
-0.66
wit
-0.66
cum
-0.65
Ri
-0.64
POSITIVE LOGITS
headed
0.75
mouth
0.74
wrong
0.70
wing
0.69
eous
0.68
havoc
0.67
behaviour
0.65
nered
0.63
flank
0.63
doing
0.62
Activations Density 0.011%