INDEX
Explanations
phrases indicating willingness or acceptance
positive emotional states or expressions of happiness
New Auto-Interp
Negative Logits
disproportion
-0.73
GOODMAN
-0.68
oons
-0.67
Posts
-0.66
senal
-0.64
hur
-0.64
Relief
-0.64
arnaev
-0.63
ulz
-0.61
anan
-0.61
POSITIVE LOGITS
embraced
0.88
awaiting
0.81
accepted
0.80
awaited
0.78
complied
0.78
transitioned
0.77
supplied
0.76
acknowledged
0.76
parted
0.75
welcomed
0.75
Activations Density 0.097%