INDEX
Explanations
instances of asking and responding to questions
New Auto-Interp
Negative Logits
леж
-0.16
uve
-0.16
trá»Ŀi
-0.15
aphael
-0.14
orget
-0.14
viso
-0.14
_UNS
-0.14
quate
-0.13
inish
-0.13
gae
-0.13
POSITIVE LOGITS
questions
0.34
if
0.34
what
0.34
whether
0.34
why
0.33
about
0.32
permission
0.30
how
0.28
point
0.27
what
0.24
Activations Density 0.050%