INDEX
Explanations
punctuation marks, particularly question marks and periods
New Auto-Interp
Negative Logits
arg
-0.17
оваÑĢ
-0.15
questions
-0.15
Nor
-0.14
thất
-0.14
argc
-0.14
olla
-0.14
iao
-0.14
sweeping
-0.13
il
-0.13
POSITIVE LOGITS
Ans
0.26
Ans
0.24
ANS
0.24
Answer
0.23
ans
0.22
Answer
0.21
answer
0.20
answer
0.20
_ans
0.20
ans
0.20
Activations Density 0.044%