INDEX
Explanations
instances of the word "control" in relation to power dynamics
New Auto-Interp
Negative Logits
ibal
-0.16
éĽ
-0.14
ussen
-0.14
ornings
-0.14
allen
-0.14
EDIATE
-0.14
æĮĻ
-0.14
oris
-0.14
arium
-0.14
åĪ·
-0.13
POSITIVE LOGITS
iser
0.16
Platt
0.15
ufe
0.14
彦
0.14
/browse
0.14
ervo
0.14
lun
0.14
nde
0.13
705
0.13
imb
0.13
Activations Density 0.034%