INDEX
Explanations
instances of the word "and."
instances of special characters or formatting in the text
New Auto-Interp
Negative Logits
bub
-0.59
Leaks
-0.52
egu
-0.51
hub
-0.51
.*
-0.51
seat
-0.51
hoe
-0.50
recogn
-0.48
foul
-0.47
—-
-0.47
POSITIVE LOGITS
romeda
0.99
rogens
0.99
rew
0.96
ERSON
0.94
rogen
0.87
then
0.71
rost
0.71
alus
0.69
rea
0.67
secondly
0.67
Activations Density 0.063%