INDEX
Explanations
terms related to self-centeredness and self-promotion
New Auto-Interp
Negative Logits
ungs
-0.15
oux
-0.15
KS
-0.15
è¢
-0.14
apa
-0.14
ella
-0.14
vos
-0.14
lus
-0.14
ks
-0.14
intim
-0.14
POSITIVE LOGITS
self
0.20
/self
0.19
Self
0.17
nish
0.17
same
0.16
-right
0.16
stown
0.16
(self
0.15
congrat
0.15
righteous
0.15
Activations Density 0.014%