INDEX
Explanations
verbal expressions indicating disapproval
expressions of approval or disapproval
New Auto-Interp
Negative Logits
aunder
-0.94
eworks
-0.77
ocument
-0.70
ixtape
-0.68
hern
-0.66
phabet
-0.62
Clear
-0.62
rez
-0.62
ixt
-0.62
diarr
-0.60
POSITIVE LOGITS
76561
0.98
passionately
0.82
uncond
0.73
whatsoever
0.73
ļé
0.71
homosexuals
0.71
Īè
0.69
freedom
0.69
seeing
0.68
unres
0.67
Activations Density 0.334%