INDEX
    Explanations

    sentences where the assistant refers to itself and issues safety/refusal disclaimers (e.g., "I am programmed..." / "As such, I cannot...").

    New Auto-Interp
    Negative Logits
    vx
    0.40
    \|_{
    0.40
    Synchronization
    0.39
    Foreign
    0.39
    trained
    0.39
    orr
    0.38
     untrained
    0.38
    Synchron
    0.38
    Glasgow
    0.38
     რაც
    0.38
    POSITIVE LOGITS
     obviously
    0.45
     sadly
    0.43
     robots
    0.42
     компью
    0.41
     lines
    0.41
     wrists
    0.40
     actually
    0.40
     computer
    0.40
     funny
    0.39
     rêves
    0.39
    Act Density 0.027%

    No Known Activations