INDEX
Explanations
Arguments related to morality and personal responsibility
New Auto-Interp
Negative Logits
Haven
-0.19
haven
-0.18
Must
-0.18
must
-0.18
must
-0.17
ala
-0.16
.Must
-0.16
Must
-0.16
ulle
-0.15
hasn
-0.14
POSITIVE LOGITS
certainly
0.24
sounds
0.21
Sounds
0.19
assumes
0.19
sounds
0.18
Sounds
0.17
strikes
0.17
Certainly
0.17
definitely
0.17
assum
0.17
Activations Density 0.081%