INDEX
Explanations
statements pertaining to morality
New Auto-Interp
Negative Logits
xual
-0.80
gow
-0.78
rams
-0.73
rooms
-0.71
Lup
-0.70
upon
-0.69
abee
-0.68
lers
-0.67
-+
-0.67
minster
-0.67
POSITIVE LOGITS
istic
1.14
izing
1.06
hazard
1.05
indignation
1.03
compass
1.03
ising
1.02
obligation
1.01
istically
0.97
dile
0.97
conscience
0.96
Activations Density 0.062%