INDEX
Explanations
technical phrases, followed by comma
Tokens that occur in the model's long explanatory responses (assistant-generated, contentful reply text).
New Auto-Interp
Negative Logits
gdyż
0.30
ponieważ
0.28
takže
0.27
sodass
0.26
pretože
0.26
sehingga
0.25
kerana
0.25
لأن
0.24
waardoor
0.24
çünkü
0.23
POSITIVE LOGITS
,
0.43
،
0.42
,
0.41
there
0.38
we
0.37
၊
0.35
፣
0.34
there
0.33
,*
0.32
、
0.31
Activations Density 0.184%