INDEX
Explanations
themes related to cultural superiority and self-righteousness
New Auto-Interp
Negative Logits
aktu
-0.17
allen
-0.16
_verbose
-0.15
ebo
-0.15
↵↵
-0.15
ì£
-0.15
ArrayOf
-0.15
CLR
-0.14
_FT
-0.14
ulur
-0.14
POSITIVE LOGITS
self
0.27
ego
0.26
superiority
0.26
arrog
0.25
pride
0.25
superior
0.24
confidence
0.24
hub
0.24
arrogance
0.24
eg
0.23
Activations Density 0.167%