INDEX
Explanations
references to actions and their moral implications
New Auto-Interp
Negative Logits
ritz
-0.18
aepernick
-0.17
descon
-0.15
zburg
-0.14
DES
-0.14
iens
-0.14
asse
-0.14
Ïĥε
-0.14
incinn
-0.14
stral
-0.14
POSITIVE LOGITS
ĥĿ
0.16
CY
0.15
elli
0.15
noop
0.15
Bullet
0.15
ellig
0.15
809
0.14
immel
0.14
acked
0.14
.bmp
0.14
Activations Density 0.001%