INDEX
Explanations
questions or phrases expressing curiosity
New Auto-Interp
Negative Logits
ollapsed
-0.15
æī¾åΰ
-0.14
hopefully
-0.14
oter
-0.13
avr
-0.13
ëIJ©ëĭĪëĭ¤
-0.13
enÃŃ
-0.13
eso
-0.13
ाहत
-0.12
agog
-0.12
POSITIVE LOGITS
wouldn
0.31
hasn
0.31
shouldn
0.31
would
0.30
should
0.30
aren
0.28
couldn
0.28
ever
0.28
didn
0.27
isn
0.26
Activations Density 0.030%