INDEX
Explanations
pronouns and references to personal identity
New Auto-Interp
Negative Logits
åµ
-0.16
inand
-0.15
æ¡Ĥ
-0.15
itag
-0.15
Justin
-0.15
ugin
-0.15
åİ»äºĨ
-0.15
Bowen
-0.15
away
-0.15
Vin
-0.15
POSITIVE LOGITS
381
0.16
947
0.16
oise
0.16
Beg
0.16
387
0.16
eon
0.15
heits
0.15
OLDER
0.15
Interceptor
0.14
zsche
0.14
Activations Density 0.010%