INDEX
Explanations
mentions of the model's name/brand (the token identifying the model).
New Auto-Interp
Negative Logits
as
1.48
are
1.09
ва
1.04
have
1.02
ores
1.00
จะ
0.96
with
0.92
on
0.91
have
0.91
க்
0.90
POSITIVE LOGITS
L
1.37
H
1.25
F
1.17
M
1.12
B
1.08
Emily
1.05
S
1.05
Sarah
1.02
in
1.00
K
1.00
Activations Density 0.028%