INDEX
Explanations
phrases indicating comparisons or similarities
New Auto-Interp
Negative Logits
acco
-0.19
orsi
-0.18
gars
-0.16
IGHL
-0.15
LIKE
-0.15
ει
-0.14
igin
-0.14
пÑĥ
-0.14
them
-0.14
iego
-0.14
POSITIVE LOGITS
they
0.29
it
0.24
there
0.23
we
0.23
able
0.20
something
0.19
maybe
0.18
WISE
0.17
a
0.17
she
0.17
Activations Density 0.026%