INDEX
Explanations
phrases related to morality and ethics
New Auto-Interp
Negative Logits
eyer
-0.17
VERRIDE
-0.16
odon
-0.15
OD
-0.14
zon
-0.14
Dove
-0.14
adal
-0.14
Accept
-0.14
.respond
-0.14
Clear
-0.14
POSITIVE LOGITS
statement
0.18
correct
0.17
statements
0.17
ãĤ¤ãĤº
0.15
correct
0.15
guts
0.15
pler
0.14
дж
0.14
ntl
0.14
说çļĦ
0.14
Activations Density 0.349%