INDEX
Explanations
references to moral and ethical concepts
New Auto-Interp
Negative Logits
elong
-0.18
getMethod
-0.16
elan
-0.15
thinkable
-0.15
elles
-0.15
esis
-0.14
اÙĪØ±ÛĮ
-0.14
el
-0.14
enheim
-0.14
аÑĤив
-0.14
POSITIVE LOGITS
izing
0.21
fiber
0.19
Mor
0.19
fibre
0.18
lez
0.18
Mor
0.18
Moral
0.17
izin
0.17
ized
0.17
istic
0.17
Activations Density 0.015%