INDEX
    Explanations

    phrases expressing negative sentiments or actions

    New Auto-Interp
    Negative Logits
    tero
    -0.20
    opes
    -0.16
    ail
    -0.15
    vox
    -0.15
    wort
    -0.14
    imar
    -0.14
    íĦ°
    -0.13
    à¸Ľà¸£à¸°à¸Īำ
    -0.13
    erosis
    -0.13
    .LoadScene
    -0.13
    POSITIVE LOGITS
    GGLE
    0.17
    ób
    0.17
     toward
    0.17
     towards
    0.15
     exc
    0.15
    aran
    0.15
     Tow
    0.15
    ledo
    0.15
    ies
    0.15
    ļĮ
    0.15
    Act Density 0.016%

    No Known Activations