INDEX
Explanations
sentences that list reasons or explanations
New Auto-Interp
Negative Logits
kl
-0.17
aight
-0.16
\grid
-0.15
ì¡°
-0.14
wort
-0.14
licable
-0.14
kla
-0.14
éĭ
-0.13
alus
-0.13
_spin
-0.13
POSITIVE LOGITS
Firstly
0.30
firstly
0.26
ãģ¾ãģļ
0.24
first
0.23
âijł
0.21
First
0.20
primero
0.20
First
0.19
наÑĩала
0.19
먼ìłĢ
0.19
Activations Density 0.192%