INDEX
    Explanations

    expressions related to absurdity and criticism of social norms

    New Auto-Interp
    Negative Logits
     thanks
    -0.17
    due
    -0.15
     благодаÑĢÑı
    -0.15
     grâce
    -0.15
    alian
    -0.14
     пÑĥÑĤем
    -0.14
     Due
    -0.14
     meaning
    -0.14
     due
    -0.14
    thanks
    -0.13
    POSITIVE LOGITS
     considering
    0.29
     Considering
    0.20
    Considering
    0.18
     indeed
    0.16
    CKER
    0.15
     behavior
    0.14
    år
    0.14
    erver
    0.14
     given
    0.14
    èĢĥèĻij
    0.14
    Act Density 0.282%

    No Known Activations