INDEX
Explanations
questions ending with a question mark
rhetorical questions
New Auto-Interp
Negative Logits
celebr
-0.68
hust
-0.66
proud
-0.62
inactive
-0.62
cross
-0.60
andi
-0.59
migration
-0.59
contagious
-0.58
happy
-0.57
upstream
-0.56
POSITIVE LOGITS
Answer
1.51
Well
1.05
³³³³
0.96
Yes
0.96
Probably
0.95
Solution
0.94
YES
0.91
Answer
0.89
³³³³³³³³³³³³³³³³
0.89
Correct
0.87
Activations Density 0.141%