INDEX
Explanations
questions throughout the text
New Auto-Interp
Negative Logits
oire
-0.70
aure
-0.70
fread
-0.69
𝙫
-0.68
a
-0.68
navbar
-0.67
Bradley
-0.67
𝓭
-0.66
aus
-0.65
ade
-0.64
POSITIVE LOGITS
%?
1.86
?
1.72
?!?
1.66
؟
1.64
’?
1.59
$?
1.58
?}
1.52
!?
1.52
?"
1.50
?
1.49
Activations Density 0.138%