INDEX
Explanations
explanations or discussions about moral and ethical dilemmas
New Auto-Interp
Negative Logits
à¸ĩหมà¸Ķ
-0.15
огÑĢа
-0.15
ãİ
-0.13
ãģłãģ£ãģ¦
-0.13
riad
-0.13
?,?,?,?,
-0.13
اگ
-0.13
имÑĥ
-0.12
aeda
-0.12
ghest
-0.12
POSITIVE LOGITS
both
1.43
both
1.32
Both
1.24
BOTH
1.23
Both
1.20
_both
1.02
beide
0.99
ambos
0.98
neither
0.84
obou
0.84
Activations Density 1.899%