INDEX
Explanations
pejorative or mocking references to unflattering office behaviors or personalities
New Auto-Interp
Negative Logits
Vice
-0.53
ору
-0.49
çu
-0.48
Ple
-0.48
Vice
-0.47
oprot
-0.44
iprot
-0.44
inescence
-0.43
\,\
-0.43
stab
-0.42
POSITIVE LOGITS
boss
3.11
boss
2.66
Boss
2.66
Boss
2.56
bosses
2.38
BOSS
2.09
BOSS
1.77
ボス
1.13
chefe
1.13
jefe
1.10
Activations Density 0.002%