INDEX
Explanations
instances of the word "don't"
New Auto-Interp
Negative Logits
“
-0.93
=”
-0.87
=’
-0.80
.”
-0.80
,”
-0.80
”,
-0.76
(“
-0.75
…”
-0.74
”),
-0.73
?”
-0.72
POSITIVE LOGITS
'
1.68
'
1.43
"
1.39
。"
1.37
'"
1.28
"
1.28
<bos>
1.24
'.
1.23
"'
1.20
'...
1.17
Activations Density 0.655%