INDEX
Explanations
mentions of famous personalities and political figures along with negative associations
New Auto-Interp
Negative Logits
eem
-0.71
Apart
-0.66
behold
-0.60
Interested
-0.60
Pair
-0.58
icking
-0.57
etting
-0.55
TG
-0.55
hand
-0.55
ove
-0.54
POSITIVE LOGITS
been
1.70
been
1.44
undergone
1.30
gotten
1.25
become
1.25
begun
1.22
Been
1.15
risen
1.12
gone
1.12
arisen
1.07
Activations Density 0.803%