INDEX
Explanations
phrases related to moral or ethical judgments
New Auto-Interp
Negative Logits
ilant
-0.88
craft
-0.80
lets
-0.77
avers
-0.76
oling
-0.75
ocket
-0.74
cest
-0.73
planes
-0.72
yss
-0.71
frey
-0.71
POSITIVE LOGITS
deviations
0.86
behaviour
0.80
Danger
0.78
behavior
0.77
ible
0.75
compromises
0.74
standards
0.74
norms
0.73
srfAttach
0.72
acceptable
0.71
Activations Density 0.047%