INDEX
Explanations
pronouns and demonstrative articles
New Auto-Interp
Negative Logits
Houſe
-0.96
wiſe
-0.90
Савезне
-0.88
houſe
-0.88
Controllo
-0.87
Anſ
-0.84
―――――
-0.84
myſelf
-0.83
pleaſure
-0.82
Reſ
-0.81
POSITIVE LOGITS
he
0.96
He
0.95
It
0.93
it
0.87
They
0.85
he
0.84
Det
0.83
Det
0.80
they
0.80
He
0.76
Activations Density 0.038%