INDEX
    Explanations

    words and phrases related to overt actions or manifestations

    New Auto-Interp
    Negative Logits
    er
    -0.34
    y
    -0.32
    oa
    -0.30
    erse
    -0.28
    oj
    -0.28
    eri
    -0.27
    erm
    -0.26
    ime
    -0.26
    ing
    -0.26
    ype
    -0.26
    POSITIVE LOGITS
    et
    0.19
    etik
    0.17
    an
    0.17
    à¸ļาà¸Ĺ
    0.17
    ta
    0.17
    chal
    0.16
    te
    0.16
    g
    0.16
    old
    0.16
    anse
    0.15
    Act Density 0.081%

    No Known Activations