INDEX
    Explanations

    technical phrases, followed by comma

    Tokens that occur in the model's long explanatory responses (assistant-generated, contentful reply text).

    New Auto-Interp
    Negative Logits
     gdyż
    0.30
     ponieważ
    0.28
     takže
    0.27
     sodass
    0.26
     pretože
    0.26
     sehingga
    0.25
     kerana
    0.25
     لأن
    0.24
     waardoor
    0.24
     çünkü
    0.23
    POSITIVE LOGITS
    ,
    0.43
    ،
    0.42
    0.41
     there
    0.38
     we
    0.37
    0.35
    0.34
    there
    0.33
    ,*
    0.32
    0.31
    Act Density 0.184%

    No Known Activations