INDEX
Explanations
phrases indicating the presence or inclusion of specific items or features
New Auto-Interp
Negative Logits
же
-0.18
ieres
-0.15
pÅĻÃŃpadnÄĽ
-0.14
arlo
-0.14
ï½¥
-0.13
andr
-0.13
acos
-0.13
ع
-0.13
Others
-0.12
oris
-0.12
POSITIVE LOGITS
both
0.28
neither
0.24
mostly
0.23
Ñģобой
0.23
mainly
0.23
nothing
0.23
elements
0.22
only
0.22
everything
0.22
plenty
0.21
Activations Density 0.291%