INDEX
    Explanations

    references to the prevalence and characteristics of specific concepts or phenomena

    New Auto-Interp
    Negative Logits
     itself
    -0.35
    çļĦä¸Ģ个
    -0.22
    å®ĥ
    -0.21
     its
    -0.20
    ä¸ĢåĢĭ
    -0.18
     Its
    -0.18
    ä¸Ģ个
    -0.18
    ä¸Ģ个人
    -0.17
    Its
    -0.17
     коÑĤоÑĢое
    -0.16
    POSITIVE LOGITS
     themselves
    0.50
     ones
    0.33
    äºĽ
    0.30
     those
    0.29
     thems
    0.27
     những
    0.25
    nt
    0.24
     mga
    0.23
    those
    0.23
     few
    0.23
    Act Density 0.756%

    No Known Activations