INDEX
Explanations
phrases indicating a reminder or a note
phrases that signify reminders or important notes
New Auto-Interp
Negative Logits
Americ
-0.68
gt
-0.68
luaj
-0.66
Hon
-0.66
anooga
-0.65
Hide
-0.64
uries
-0.64
rill
-0.64
Helpful
-0.63
Gordon
-0.63
POSITIVE LOGITS
ONLY
0.98
NEVER
0.92
ALSO
0.90
BEFORE
0.82
NOT
0.82
DID
0.81
ALWAYS
0.81
DOES
0.80
MUCH
0.77
actually
0.77
Activations Density 0.582%