INDEX
Explanations
mentions of groups of people using pronouns like 'they' and 'we'
auxiliary verbs and pronouns
New Auto-Interp
Negative Logits
their
-1.17
their
-1.17
Their
-1.16
Their
-1.06
thier
-0.91
kanilang
-0.89
他们的
-0.89
THEIR
-0.88
他們的
-0.86
leur
-0.84
POSITIVE LOGITS
are
1.16
have
0.85
aren
0.85
were
0.79
don
0.76
want
0.74
know
0.73
introduce
0.73
join
0.71
come
0.70
Activations Density 1.892%