INDEX
Explanations
phrases and terms related to justification, particularly in relation to moral or ethical behavior
New Auto-Interp
Negative Logits
ÙĦاÙĨ
-0.15
alama
-0.15
jac
-0.15
ichel
-0.14
λαν
-0.14
_DEPEND
-0.14
_DEPRECATED
-0.14
endi
-0.14
åħĥ
-0.13
iw
-0.13
POSITIVE LOGITS
why
0.23
justify
0.22
justification
0.20
why
0.18
excuse
0.18
reasons
0.17
justify
0.16
rationale
0.16
ably
0.16
justified
0.16
Activations Density 0.057%