INDEX
    Explanations

    refusing harmful content requests

    New Auto-Interp
    Negative Logits
     meestal
    1.08
     biasanya
    0.99
    Usually
    0.97
     Usually
    0.93
     usually
    0.92
     Biasanya
    0.91
    usually
    0.90
     selalu
    0.89
     uguale
    0.88
     suelen
    0.87
    POSITIVE LOGITS
     represents
    2.34
     represent
    2.11
    represents
    2.04
     Represents
    1.85
     rappresenta
    1.79
     raises
    1.79
     representa
    1.76
     constitutes
    1.69
     représente
    1.64
     Represent
    1.63
    Act Density 0.652%

    No Known Activations