INDEX
Explanations
phrases related to moral or ethical judgments and critical assessments of decision-making
New Auto-Interp
Negative Logits
485
-0.16
ye
-0.16
oha
-0.16
unh
-0.15
zer
-0.15
ode
-0.15
urine
-0.14
lox
-0.14
orie
-0.14
inge
-0.13
POSITIVE LOGITS
when
0.18
khi
0.18
tol
0.17
když
0.17
adian
0.17
Gregg
0.16
aret
0.15
when
0.15
errupt
0.15
çĸ
0.15
Activations Density 0.203%