INDEX
    Explanations

    the model's refusal to generate harmful, unethical, or inappropriate content.

    New Auto-Interp
    Negative Logits
     gives
    0.58
     provides
    0.58
     denotes
    0.57
     merupakan
    0.55
     adalah
    0.54
    0.53
     Provides
    0.53
     constitutes
    0.52
     generates
    0.52
     requires
    0.52
    POSITIVE LOGITS
    However
    1.05
    But
    0.96
    Therefore
    0.95
    Consequently
    0.84
    但是
    0.84
    BUT
    0.83
    therefore
    0.81
    Furthermore
    0.80
    Moreover
    0.79
     However
    0.78
    Act Density 1.199%

    No Known Activations