INDEX
    Explanations

    asking questions or stating uncertainty

    New Auto-Interp
    Negative Logits
     on
    -1.15
     its
    -1.08
    ?!”
    -1.06
     will
    -1.02
     IN
    -1.00
    -0.98
     D
    -0.97
     platform
    -0.96
    因为它
    -0.96
     out
    -0.93
    POSITIVE LOGITS
     this
    3.08
     этого
    2.23
     این
    2.03
     these
    2.03
     этой
    2.02
     এই
    1.99
    はこの
    1.99
    这种
    1.98
     этом
    1.97
     этим
    1.95
    Act Density 0.212%

    No Known Activations