INDEX
    Explanations

    language related to arguments or inconsistencies in reasoning

    New Auto-Interp
    Negative Logits
    ÁCT
    -0.53
    ennen
    -0.46
     Pender
    -0.44
     statt
    -0.44
    مث
    -0.41
    års
    -0.41
    orex
    -0.40
     Зак
    -0.40
     glabrous
    -0.40
     مرئيه
    -0.40
    POSITIVE LOGITS
     aside
    1.86
     side
    1.66
     Side
    1.52
    aside
    1.50
    Side
    1.50
    side
    1.45
     SIDE
    1.37
    SIDE
    1.33
     Aside
    1.32
     sides
    1.31
    Act Density 0.128%

    No Known Activations