INDEX
Explanations
a mix of characters and possibly specific words or phrases that seem to not have a clear thematic link
non-standard and special characters in the text
New Auto-Interp
Negative Logits
ettings
-0.80
tremend
-0.76
yip
-0.75
eatures
-0.64
Grimes
-0.64
olean
-0.62
ottesville
-0.62
wana
-0.60
hib
-0.60
Wyr
-0.60
POSITIVE LOGITS
ë
0.78
à¤
0.75
ÑĤ
0.73
ì
0.73
à¸
0.72
ìĿ
0.72
inen
0.71
talk
0.70
ãģ£
0.69
ëĭ
0.68
Activations Density 0.043%