INDEX
Explanations
brackets and set notation, particularly in a mathematical context
New Auto-Interp
Negative Logits
T
-0.41
k
-0.34
s
-0.32
n
-0.32
l
-0.31
x
-0.30
j
-0.28
i
-0.28
TU
-0.27
$
-0.25
POSITIVE LOGITS
[
0.19
=}
0.18
'}
0.17
}\
0.16
}
0.16
}`
0.15
;}
0.15
>}
0.15
}}
0.15
}↵↵↵
0.15
Activations Density 0.100%