INDEX
Explanations
expressions of confusion or frustration
New Auto-Interp
Negative Logits
Oops
-0.16
oops
-0.16
drv
-0.16
Beard
-0.15
Crud
-0.15
fait
-0.15
åĵ
-0.14
åĹ¯
-0.14
pron
-0.14
Hmm
-0.14
POSITIVE LOGITS
why
0.33
Why
0.26
seriously
0.24
why
0.24
WHY
0.23
surely
0.22
how
0.22
Seriously
0.21
Why
0.20
为ä»Ģä¹Ī
0.19
Activations Density 0.310%