INDEX
Explanations
references to moral responsibility and ethical considerations
New Auto-Interp
Negative Logits
à¸ĩหมà¸Ķ
-0.15
огÑĢа
-0.14
?,?,?,?,
-0.14
ndon
-0.13
ãİ
-0.13
اگ
-0.13
ãģłãģ£ãģ¦
-0.12
имÑĥ
-0.12
unta
-0.12
ghest
-0.12
POSITIVE LOGITS
both
1.37
both
1.26
Both
1.20
BOTH
1.17
Both
1.16
beide
0.98
_both
0.98
ambos
0.95
обо
0.82
obou
0.81
Activations Density 1.958%