INDEX
Explanations
punctuation marks and formatting in the text
New Auto-Interp
Negative Logits
mar
-0.19
mur
-0.16
kowski
-0.15
ra
-0.15
434
-0.14
mys
-0.14
her
-0.14
ry
-0.14
ric
-0.14
illage
-0.14
POSITIVE LOGITS
áty
0.18
utar
0.15
Picker
0.15
eyen
0.15
iddi
0.14
ãĥ³ãĤ¬
0.14
oola
0.14
ookies
0.14
uvw
0.14
wich
0.14
Activations Density 0.003%