INDEX
Explanations
asking questions for clarification
New Auto-Interp
Negative Logits
doesn
0.47
should
0.44
appropriate
0.43
Should
0.43
can
0.43
should
0.42
whatever
0.42
needs
0.42
proper
0.41
serious
0.41
POSITIVE LOGITS
yours
0.74
Yours
0.73
curioso
0.70
curious
0.68
Interess
0.61
Interestingly
0.59
Did
0.57
interess
0.57
Curious
0.56
curios
0.56
Activations Density 0.006%