INDEX
    Explanations

    phrases indicating abandonment or neglect

    New Auto-Interp
    Negative Logits
    lav
    -0.17
    ÏĥÏĦÏĮ
    -0.16
    wang
    -0.16
    avra
    -0.15
     disguise
    -0.15
    kel
    -0.14
    akh
    -0.14
    ,$_
    -0.14
    GU
    -0.13
     Keeping
    -0.13
    POSITIVE LOGITS
     behind
    0.40
     alone
    0.37
     Behind
    0.33
    alone
    0.32
    Behind
    0.30
    beh
    0.28
     Alone
    0.28
    -alone
    0.25
     aside
    0.24
    à¹Ħว
    0.22
    Act Density 0.052%

    No Known Activations