INDEX
    Explanations

    phrases related to specific actions or instructions

    phrases indicating negation or refusal

    New Auto-Interp
    Negative Logits
     Palest
    -0.75
    anwhile
    -0.70
     mathemat
    -0.70
     RAD
    -0.66
     Fatal
    -0.64
     Morg
    -0.63
    çīĪ
    -0.63
     Leilan
    -0.62
     Hir
    -0.62
     Blaz
    -0.60
    POSITIVE LOGITS
    Ķ
    1.23
    ¬
    1.21
    ª
    1.20
    ĸ
    1.19
    £
    1.19
    ©
    1.14
    ¿
    1.14
    ¼
    1.14
    ij
    1.13
    Ļ
    1.12
    Act Density 0.170%

    No Known Activations