INDEX
Explanations
references to the pronoun "them."
New Auto-Interp
Negative Logits
RTX
-0.80
mire
-0.75
Limit
-0.65
Charg
-0.64
Farn
-0.63
Press
-0.63
083
-0.61
Pause
-0.60
politics
-0.60
Fulton
-0.60
POSITIVE LOGITS
atic
1.10
atically
1.01
selves
0.91
perished
0.84
selves
0.84
were
0.82
alian
0.81
individually
0.80
are
0.80
originated
0.79
Activations Density 0.023%