INDEX
    Explanations

    AI assistant refusing requests

    New Auto-Interp
    Negative Logits
    abhavam
    0.40
    enchymal
    0.39
    ské
    0.39
    Neurons
    0.39
    льт
    0.38
    jenis
    0.38
    Breakpoint
    0.38
    𝘭
    0.38
    toggleClass
    0.38
    0.38
    POSITIVE LOGITS
     assistant
    0.50
     avoid
    0.49
     Avoiding
    0.46
     assistants
    0.45
     Avoid
    0.42
    避免
    0.42
     assist
    0.41
     gaf
    0.41
     lounge
    0.41
     avoids
    0.41
    Act Density 0.006%

    No Known Activations