INDEX
Explanations
words and phrases related to power dynamics and authority
New Auto-Interp
Negative Logits
sis
-0.20
_power
-0.16
iban
-0.16
poder
-0.15
nore
-0.15
pouvoir
-0.15
POWER
-0.15
æ´¥
-0.15
arget
-0.15
ureka
-0.14
POSITIVE LOGITS
fully
0.42
houses
0.32
full
0.29
ful
0.28
bro
0.24
lifting
0.24
lessness
0.24
FUL
0.23
broker
0.23
train
0.22
Activations Density 0.072%