INDEX
    Explanations

    sentences that express reasoning or justification

    New Auto-Interp
    Negative Logits
    hr
    -0.17
    ly
    -0.15
    vap
    -0.15
    软
    -0.14
    ndl
    -0.14
    ilk
    -0.14
    ги
    -0.14
    aleb
    -0.14
    Ñģен
    -0.14
    onas
    -0.14
    POSITIVE LOGITS
    ,[],
    0.17
    æĺŃ
    0.15
    ziej
    0.15
    ÙĦÛĮس
    0.15
    igor
    0.15
    èijī
    0.14
    ;line
    0.14
    508
    0.14
    WAYS
    0.14
     kli
    0.14
    Act Density 0.293%

    No Known Activations