INDEX
    Explanations

    punctuation marks at the end of sentences

    New Auto-Interp
    Negative Logits
     deceptive
    -0.61
     confidentiality
    -0.57
     whistleblowers
    -0.57
     privacy
    -0.57
     outgoing
    -0.56
     safety
    -0.56
     advis
    -0.55
     confidential
    -0.54
     trusted
    -0.54
     stewards
    -0.53
    POSITIVE LOGITS
    imgur
    1.01
    e
    0.91
    aca
    0.75
    hat
    0.75
    seed
    0.74
    MX
    0.71
    hs
    0.71
    minimum
    0.71
    ¼
    0.70
    medium
    0.70
    Act Density 0.021%

    No Known Activations