INDEX
Explanations
specific names and proper nouns across different languages
Tokens preceding non-English text
foreign language beginnings
New Auto-Interp
Negative Logits
,
-0.70
↵
-0.70
(
-0.67
:
-0.67
.
-0.66
!
-0.64
1
-0.62
<eos>
-0.62
;
-0.62
(
-0.61
POSITIVE LOGITS
ſelves
1.16
Мексичка
1.12
NUMX
1.10
ouſly
1.08
ſelf
1.08
ghijklmnop
1.06
geſ
1.04
ERSITY
1.00
leſs
1.00
eſt
0.98
Activations Density 0.060%