INDEX
Explanations
phrases and terms related to authority and conflict
New Auto-Interp
Negative Logits
osaic
-0.15
boyc
-0.14
RESERVED
-0.14
reserved
-0.14
ulus
-0.14
loub
-0.14
-0.14
refusal
-0.14
ibold
-0.14
å´
-0.14
POSITIVE LOGITS
control
0.26
eliminate
0.23
qu
0.23
sil
0.23
stop
0.23
cur
0.22
æĬij
0.22
curb
0.22
-control
0.22
suppression
0.22
Activations Density 0.094%