INDEX
Explanations
phrases related to moral judgments and opinions
expressions related to the concept of acceptability or unacceptability
New Auto-Interp
Negative Logits
craft
-0.82
enfranch
-0.78
dream
-0.77
ilant
-0.76
planes
-0.76
ynthesis
-0.74
ocket
-0.72
wright
-0.72
frey
-0.72
lets
-0.72
POSITIVE LOGITS
deviations
0.79
CPC
0.72
ible
0.71
standards
0.71
Danger
0.71
srfAttach
0.70
itable
0.70
compromises
0.69
Gi
0.69
norms
0.69
Activations Density 0.037%