INDEX
Explanations
non-English characters that are part of some kind of pattern or sequence
special characters or symbols in the text
New Auto-Interp
Negative Logits
Flavoring
-0.98
nings
-0.97
awaru
-0.94
contrace
-0.85
merce
-0.83
mathemat
-0.75
kef
-0.73
thodox
-0.73
holders
-0.71
ifice
-0.71
POSITIVE LOGITS
ãĤ¡
1.05
ople
0.90
ر
0.82
urn
0.80
ι
0.79
âĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢâĶĢ
0.75
а
0.75
¹
0.75
ern
0.75
inx
0.75
Activations Density 0.009%