INDEX
Explanations
phrases discussing justifications or explanations for actions or beliefs
New Auto-Interp
Negative Logits
gow
-0.17
uye
-0.15
/run
-0.15
каÑģ
-0.15
ay
-0.14
achi
-0.14
moy
-0.14
uy
-0.14
ipa
-0.14
/read
-0.13
POSITIVE LOGITS
why
0.22
why
0.19
üstü
0.18
lessly
0.17
hift
0.17
nement
0.16
APPER
0.16
dolayı
0.16
afort
0.16
nal
0.16
Activations Density 0.047%