INDEX
Explanations
phrases related to acceptability or lack thereof
terms related to acceptability and unacceptability
New Auto-Interp
Negative Logits
ilant
-0.91
planes
-0.78
craft
-0.76
onso
-0.76
ynthesis
-0.73
oling
-0.72
pelling
-0.71
ocket
-0.71
cest
-0.70
frey
-0.70
POSITIVE LOGITS
behaviour
0.82
behavior
0.81
srfAttach
0.81
deviations
0.78
Danger
0.77
compromises
0.77
standards
0.74
norms
0.73
defaults
0.71
precedent
0.71
Activations Density 0.051%