INDEX
Explanations
instances of the word "wrong" and related expressions of moral judgment or ethical considerations
New Auto-Interp
Negative Logits
otto
-0.18
ibo
-0.16
uegos
-0.15
AAF
-0.15
.ly
-0.14
icense
-0.14
lernen
-0.14
jective
-0.14
loid
-0.14
mux
-0.14
POSITIVE LOGITS
fully
0.21
ulent
0.16
acha
0.16
ti
0.15
tt
0.15
zeitig
0.14
oster
0.14
ysqli
0.14
aken
0.14
omas
0.14
Activations Density 0.023%