INDEX
    Explanations

    harmful content refusal

    New Auto-Interp
    Negative Logits
    encana
    1.99
     biographer
    1.74
    diği
    1.73
    وبه
    1.68
     rém
    1.64
     suatu
    1.62
     Scheme
    1.61
     scheme
    1.60
     an
    1.58
    这篇文章
    1.52
    POSITIVE LOGITS
    7
    4.03
    8
    3.99
    6
    3.80
    5
    3.59
    0
    3.56
    9
    3.55
    4
    3.40
     distinct
    3.38
    3
    3.30
    rinsic
    3.27
    Act Density 0.369%

    No Known Activations