INDEX
Explanations
phrases and concepts related to morality and ethical implications
New Auto-Interp
Negative Logits
637
-0.15
ÙĤÛĮ
-0.15
Wy
-0.15
Wy
-0.14
Burnett
-0.14
-validate
-0.14
awy
-0.13
łĢ
-0.13
-avatar
-0.13
spe
-0.13
POSITIVE LOGITS
term
0.22
description
0.20
apt
0.19
word
0.18
-description
0.18
term
0.17
description
0.17
Term
0.17
adjective
0.16
word
0.16
Activations Density 0.075%