INDEX
    Explanations

    references to contrasting concepts, particularly related to good and bad

    New Auto-Interp
    Negative Logits
    inson
    -0.15
    ãĤ«ãĥ«
    -0.15
    lep
    -0.15
    оло
    -0.14
    Lİ
    -0.14
     Depot
    -0.14
    218
    -0.14
    atorium
    -0.14
    ULONG
    -0.14
    iliar
    -0.14
    POSITIVE LOGITS
     bad
    0.49
    bad
    0.45
     Bad
    0.42
    Bad
    0.40
    _bad
    0.38
     BAD
    0.34
    åĿı
    0.32
    .bad
    0.31
    BAD
    0.29
     evil
    0.28
    Act Density 0.081%

    No Known Activations