INDEX
Explanations
statements related to moral and ethical principles
New Auto-Interp
Negative Logits
лаÑĪ
-0.16
Worldwide
-0.15
alars
-0.15
ÅĤu
-0.15
altar
-0.15
serter
-0.14
å½
-0.14
ervers
-0.14
ulong
-0.14
zion
-0.14
POSITIVE LOGITS
Nack
0.17
nor
0.16
itis
0.15
JOR
0.15
oader
0.15
åĽ£
0.14
489
0.14
Nor
0.14
SOCK
0.14
ronic
0.13
Activations Density 0.248%