INDEX
Explanations
comparisons emphasizing similarity and equivalence
New Auto-Interp
Negative Logits
letic
-0.16
cow
-0.16
onga
-0.16
illac
-0.16
iller
-0.15
bas
-0.15
ãĥªãĥ¼ãĤº
-0.15
enor
-0.14
ipop
-0.14
à¥įतन
-0.14
POSITIVE LOGITS
they
0.23
others
0.19
she
0.18
THEY
0.18
we
0.17
he
0.17
manner
0.17
did
0.16
h
0.16
they
0.16
Activations Density 0.084%