INDEX
    Explanations

    sentences that critique societal norms and behaviors, particularly regarding humor and double standards in media

    New Auto-Interp
    Negative Logits
    .�
    -0.77
    ----------
    -0.76
    .''
    -0.72
    ,''
    -0.72
    Copyright
    -0.70
    tion
    -0.67
    .}
    -0.67
    `.
    -0.66
    properties
    -0.65
     |
    -0.64
    POSITIVE LOGITS
     haunted
    0.72
     inexpl
    0.70
     cannibal
    0.69
     coughing
    0.69
     scor
    0.68
     endlessly
    0.68
     sleek
    0.68
     humming
    0.68
     glowing
    0.68
     ooz
    0.67
    Act Density 0.433%

    No Known Activations