INDEX
Explanations
concepts related to morality and ethical behavior
New Auto-Interp
Negative Logits
zen
-0.16
ej
-0.14
Barr
-0.14
ner
-0.13
usi
-0.13
Platt
-0.13
complied
-0.13
pla
-0.13
476
-0.13
éļ
-0.13
POSITIVE LOGITS
anke
0.15
isches
0.14
enty
0.14
ARRANT
0.14
airo
0.14
ána
0.13
æĪIJ人
0.13
å¯Ħ
0.13
лаз
0.13
entral
0.13
Activations Density 0.750%