INDEX
Explanations
instances of authority figures and their interactions with subordinates
New Auto-Interp
Negative Logits
主人
-0.17
мом
-0.17
ÑħозÑı
-0.17
utters
-0.16
klä
-0.15
geber
-0.14
icros
-0.14
лава
-0.14
mdir
-0.14
ála
-0.14
POSITIVE LOGITS
subordinate
0.28
assistant
0.27
assistants
0.26
his
0.26
followers
0.25
associate
0.25
deputy
0.25
åī¯
0.24
team
0.22
deputies
0.22
Activations Density 0.252%