INDEX
Explanations
questions and assertions about societal values and ethics
New Auto-Interp
Negative Logits
ares
-0.16
omet
-0.14
imulator
-0.14
kus
-0.14
cdf
-0.14
Finder
-0.14
hat
-0.14
Brew
-0.13
Sims
-0.13
bah
-0.13
POSITIVE LOGITS
serious
0.28
Serious
0.23
worth
0.23
sane
0.23
serious
0.22
civilized
0.21
Worth
0.21
sensible
0.20
anyone
0.20
decent
0.20
Activations Density 0.190%