INDEX
    Explanations

    phrases indicating exceptions, contradictions, or nuanced arguments

    New Auto-Interp
    Negative Logits
    ibur
    -0.15
    ãi
    -0.13
    597
    -0.13
    foy
    -0.13
     sẵn
    -0.13
    ذ
    -0.13
    तम
    -0.13
    stup
    -0.12
    ipop
    -0.12
    æ´ĭ
    -0.12
    POSITIVE LOGITS
     true
    0.68
    true
    0.55
     TRUE
    0.47
     True
    0.46
    True
    0.41
    TRUE
    0.41
    .true
    0.40
    	true
    0.40
     untrue
    0.38
    (true
    0.37
    Act Density 0.106%

    No Known Activations