INDEX
    Explanations

    explaining refusals and disclaimers

    New Auto-Interp
    Negative Logits
    0.97
     ];
    0.93
     /
    0.90
     );
    0.88
     \}
    0.88
     】,
    0.87
     ');
    0.86
     \]
    0.86
    0.83
     \[
    0.82
    POSITIVE LOGITS
     Interestingly
    0.95
    <unused940>
    0.92
    <unused1658>
    0.87
    Interestingly
    0.86
    As
    0.83
     Fortunately
    0.82
     Unlike
    0.81
     Thankfully
    0.80
    After
    0.78
     As
    0.78
    Act Density 0.126%

    No Known Activations