INDEX
    Explanations

    even followed by harmful actions

    New Auto-Interp
    Negative Logits
     appunto
    0.45
     anzi
    0.43
     it
    0.43
    сцю
    0.42
    ziehungs
    0.41
    🄴
    0.41
     thereby
    0.41
     ellos
    0.40
    ственно
    0.40
     fashioned
    0.39
    POSITIVE LOGITS
    即使
    0.60
     حتی
    0.57
    Even
    0.57
     Even
    0.56
     даже
    0.56
     EVEN
    0.54
     even
    0.52
     حتى
    0.52
    就算
    0.52
    even
    0.51
    Act Density 0.052%

    No Known Activations