INDEX
    Explanations

    words related to questioning and contradictions

    New Auto-Interp
    Negative Logits
    andal
    -0.15
    shouldBe
    -0.15
     (!!
    -0.15
    /tiny
    -0.13
    rium
    -0.13
     theres
    -0.13
    /misc
    -0.13
    ?=.*
    -0.12
    ÑģеÑĢ
    -0.12
    ohen
    -0.12
    POSITIVE LOGITS
     not
    0.69
     nicht
    0.67
     tidak
    0.60
     niet
    0.59
     không
    0.58
     नह
    0.57
     не
    0.57
     não
    0.56
     ikke
    0.56
     NOT
    0.54
    Act Density 1.354%

    No Known Activations