INDEX
Explanations
characters in non-English alphabets
specific characters or names with diacritics or unusual orthography
New Auto-Interp
Negative Logits
etts
-0.91
enna
-0.88
yrus
-0.85
artisan
-0.84
rils
-0.82
patrick
-0.78
hens
-0.77
ements
-0.75
acity
-0.74
andestine
-0.73
POSITIVE LOGITS
士
0.97
ufact
0.87
ãĤ¨ãĥ«
0.87
Ü
0.82
``
0.80
vous
0.77
âĶĢâĶĢ
0.76
WithNo
0.75
ccording
0.74
å§«
0.71
Activations Density 0.035%