INDEX
    Explanations

    negations and negative phrases

    New Auto-Interp
    Negative Logits
    ogan
    -0.07
    speaker
    -0.07
    bane
    -0.07
    sts
    -0.06
    onse
    -0.06
    ousel
    -0.06
     æ°
    -0.06
    Æ¡
    -0.06
    лем
    -0.06
    otive
    -0.06
    POSITIVE LOGITS
     alone
    0.17
     Alone
    0.13
    alone
    0.13
     sole
    0.12
     saja
    0.11
    -alone
    0.10
    å͝ä¸Ģ
    0.09
    ë¿IJ
    0.09
     only
    0.09
     seul
    0.09
    Act Density 0.012%

    No Known Activations