INDEX
Explanations
reporting findings
implies/indicates followed by consequence/explanation
New Auto-Interp
Negative Logits
in
0.84
of
0.70
at
0.64
was
0.64
is
0.63
of
0.63
i
0.60
are
0.57
wenn
0.55
are
0.55
POSITIVE LOGITS
N
0.50
ת
0.46
B
0.41
ין
0.41
T
0.39
Peach
0.38
Z
0.37
nonconvex
0.37
다고
0.36
핍
0.36
Activations Density 5.908%