INDEX
Explanations
references to individuals and their actions, particularly focusing on pronouns and related verbs
New Auto-Interp
Negative Logits
lato
-0.47
riction
-0.47
èvement
-0.46
AIRE
-0.46
ària
-0.45
ourites
-0.45
♀️
-0.45
yf
-0.44
oupe
-0.43
ariato
-0.43
POSITIVE LOGITS
theirs
1.06
Their
0.85
his
0.83
they
0.83
His
0.81
Their
0.81
hers
0.80
They
0.80
THEY
0.79
They
0.78
Activations Density 0.373%