INDEX
Explanations
words that refer to pronouns and their usage
New Auto-Interp
Negative Logits
liš
-0.15
Secondary
-0.15
Pyramid
-0.15
ter
-0.15
neys
-0.15
doubly
-0.14
Pun
-0.14
secondary
-0.14
ori
-0.14
native
-0.14
POSITIVE LOGITS
pron
0.31
Pron
0.25
pron
0.23
Singular
0.20
singular
0.18
azen
0.17
Us
0.17
demonstr
0.17
Us
0.17
singular
0.17
Activations Density 0.035%