INDEX
Explanations
sentences that express reasoning or justification
New Auto-Interp
Negative Logits
hr
-0.17
ly
-0.15
vap
-0.15
软
-0.14
ndl
-0.14
ilk
-0.14
ги
-0.14
aleb
-0.14
Ñģен
-0.14
onas
-0.14
POSITIVE LOGITS
,[],
0.17
æĺŃ
0.15
ziej
0.15
ÙĦÛĮس
0.15
igor
0.15
èijī
0.14
;line
0.14
508
0.14
WAYS
0.14
kli
0.14
Activations Density 0.293%