INDEX
Explanations
instances of personal pronouns and identity-related language
New Auto-Interp
Negative Logits
mel
-0.16
ander
-0.15
anie
-0.14
McInt
-0.14
vir
-0.14
Merrill
-0.14
ida
-0.14
occo
-0.13
bour
-0.13
synchron
-0.13
POSITIVE LOGITS
eventually
0.28
Eventually
0.26
eventual
0.24
Eventually
0.23
Initially
0.20
initially
0.19
gradually
0.19
Initially
0.19
sooner
0.17
寻
0.17
Activations Density 0.005%