INDEX
Explanations
phrases related to ethical and moral reasoning
New Auto-Interp
Negative Logits
ajo
-0.16
adera
-0.15
dar
-0.15
endi
-0.14
åĨł
-0.14
ç»ĻæĪij
-0.14
ToBounds
-0.14
dar
-0.14
IALIZED
-0.14
issen
-0.13
POSITIVE LOGITS
against
0.32
against
0.26
Against
0.26
対
0.25
Against
0.21
fight
0.21
对
0.21
пÑĢоÑĤив
0.20
пÑĢоÑĤи
0.20
proti
0.20
Activations Density 0.396%