INDEX
Explanations
words related to negative judgments or criticisms
terms related to disgrace and moral failure
New Auto-Interp
Negative Logits
ellipt
-0.89
envelope
-0.77
Avalon
-0.72
ultrasound
-0.65
apt
-0.63
equilibrium
-0.62
impulse
-0.62
airplane
-0.62
addafi
-0.61
shutter
-0.61
POSITIVE LOGITS
ful
1.41
fully
1.15
orial
0.94
ious
0.94
beat
0.89
xual
0.87
yers
0.84
edly
0.83
forth
0.83
es
0.82
Activations Density 0.045%