INDEX
    Explanations

    negative judgment and insults

    New Auto-Interp
    Negative Logits
    unexpected
    0.45
     ảo
    0.42
     Lx
    0.42
    0.42
    is
    0.41
    negative
    0.41
    arci
    0.41
    有问题
    0.41
     COX
    0.41
     ALWAYS
    0.41
    POSITIVE LOGITS
     stupid
    0.64
     disgusting
    0.59
     stupidity
    0.56
    🤮
    0.56
     plut
    0.55
     insults
    0.55
     vulgar
    0.54
     lousy
    0.54
    まとめ
    0.53
     inept
    0.53
    Act Density 0.105%

    No Known Activations