INDEX
Explanations
the letters "th" at the beginning of words
New Auto-Interp
Negative Logits
ech
-0.17
antha
-0.17
ease
-0.15
ishly
-0.15
imum
-0.15
kinson
-0.15
igans
-0.14
па
-0.14
edback
-0.14
hid
-0.14
POSITIVE LOGITS
ematic
0.22
Th
0.21
ales
0.20
ailand
0.20
th
0.19
omas
0.19
ALES
0.19
rought
0.18
ematics
0.18
rough
0.18
Activations Density 0.034%