INDEX
Explanations
concepts related to power dynamics and authority
New Auto-Interp
Negative Logits
arend
-0.17
afari
-0.16
eways
-0.15
eward
-0.15
urnal
-0.15
eker
-0.15
zelf
-0.14
-worthy
-0.14
bable
-0.14
owers
-0.14
POSITIVE LOGITS
fully
0.27
full
0.21
ful
0.20
/power
0.20
735
0.19
lessness
0.18
633
0.17
fu
0.16
lier
0.16
aged
0.15
Activations Density 0.064%