INDEX
Explanations
punctuation marks and certain high-frequency function words
New Auto-Interp
Negative Logits
ãģıãĤĮ
-0.17
arel
-0.15
oga
-0.15
marvin
-0.15
urge
-0.15
von
-0.14
inen
-0.14
ãĤ¤ãĥ«
-0.14
controllers
-0.14
績
-0.14
POSITIVE LOGITS
änn
0.15
пÑĢид
0.15
omik
0.14
oppins
0.14
emin
0.14
κÏģα
0.14
udden
0.14
زر
0.14
erval
0.13
è͵
0.13
Activations Density 0.002%