INDEX
Explanations
expressions of surprise or disbelief
New Auto-Interp
Negative Logits
folk
-0.15
enger
-0.15
ume
-0.14
rb
-0.14
asons
-0.14
uento
-0.14
.undo
-0.14
alion
-0.14
resenter
-0.14
.dw
-0.13
POSITIVE LOGITS
snap
0.24
wait
0.22
shoot
0.22
yes
0.21
boy
0.20
hk
0.20
bother
0.20
lord
0.19
g
0.19
snap
0.18
Activations Density 0.017%