INDEX
    Explanations

    negations and expressions of doubt or uncertainty

    New Auto-Interp
    Negative Logits
    .synthetic
    -0.17
    oley
    -0.17
    UNUSED
    -0.16
    aghan
    -0.15
    ίθ
    -0.14
    alker
    -0.14
    ilerini
    -0.14
    .Empty
    -0.14
    okers
    -0.14
    olla
    -0.14
    POSITIVE LOGITS
     bad
    0.90
     Bad
    0.85
    Bad
    0.79
    bad
    0.79
     BAD
    0.77
    _bad
    0.66
    BAD
    0.65
     worst
    0.63
     worse
    0.63
    .bad
    0.59
    Act Density 0.220%

    No Known Activations