INDEX
Explanations
concepts related to moral and ethical dilemmas
New Auto-Interp
Negative Logits
intro
-0.19
introduction
-0.17
836
-0.17
akov
-0.15
introducing
-0.15
olas
-0.15
roman
-0.15
é¼ĵ
-0.14
835
-0.14
dit
-0.14
POSITIVE LOGITS
others
0.21
someone
0.20
somebody
0.20
anybody
0.20
otherwise
0.20
unintention
0.20
anyone
0.20
everybody
0.20
everyone
0.19
someone
0.19
Activations Density 0.028%