INDEX
Explanations
words related to dissatisfaction or frustration
references to emotional reactions and social interactions
New Auto-Interp
Negative Logits
GOODMAN
-0.70
itionally
-0.66
roximately
-0.66
ocument
-0.65
certification
-0.64
encia
-0.60
rowth
-0.58
certified
-0.57
aceae
-0.56
hess
-0.56
POSITIVE LOGITS
decency
0.80
clich
0.77
lest
0.76
coward
0.76
cynicism
0.72
uttered
0.70
nihil
0.70
insults
0.69
inco
0.69
modesty
0.68
Activations Density 1.756%