INDEX
Explanations
questions and expressions of hope or concern
New Auto-Interp
Negative Logits
aren
-0.18
Isn
-0.17
wasn
-0.17
isn
-0.16
only
-0.16
zwar
-0.15
haven
-0.15
.look
-0.14
lÃł
-0.14
There
-0.14
POSITIVE LOGITS
happens
0.23
happened
0.20
seper
0.19
we
0.18
drew
0.17
Separ
0.17
separates
0.16
got
0.16
Separ
0.16
kept
0.16
Activations Density 0.085%