INDEX
Explanations
words related to negative events or situations
terms related to feelings of shame or social failure
New Auto-Interp
Negative Logits
corrid
-0.80
ramer
-0.78
bors
-0.77
eways
-0.76
estone
-0.75
cium
-0.72
rame
-0.69
Aires
-0.69
livest
-0.68
nda
-0.68
POSITIVE LOGITS
embarrassment
1.16
certs
0.80
ously
0.75
è£ħ
0.75
èª
0.74
dishon
0.73
ãĥĭ
0.73
é¾įå¥ij士
0.73
UAL
0.72
lessly
0.72
Activations Density 0.015%