INDEX
Explanations
concepts related to moral dilemmas and ethical reasoning
New Auto-Interp
Negative Logits
OrFail
-0.15
/browse
-0.15
iel
-0.15
663
-0.14
irl
-0.14
IEL
-0.14
à¹Ģ
-0.14
Lifetime
-0.14
/dd
-0.14
imum
-0.14
POSITIVE LOGITS
authority
0.22
reality
0.22
Reality
0.21
Authority
0.20
Truth
0.19
truth
0.19
wrong
0.19
propri
0.17
Wrong
0.17
WRONG
0.17
Activations Density 0.098%