INDEX
    Explanations

    now let's continue analysis

    New Auto-Interp
    Negative Logits
    Although
    0.60
    Menurut
    0.58
    Aunque
    0.57
     മാത്രമല്ല
    0.56
     زیرا
    0.55
    雖然
    0.55
     Aunque
    0.54
    atschapp
    0.53
    Jangan
    0.53
     също
    0.52
    POSITIVE LOGITS
    ,
    1.34
     we
    1.30
     they
    1.29
     there
    1.10
    ،
    1.09
     it
    1.06
     you
    0.99
    0.97
    ,(
    0.90
    ,《
    0.90
    Act Density 1.228%

    No Known Activations