INDEX
    Explanations

    phrases indicating challenging authority or societal norms

    statements about taking risks or challenging societal norms

    New Auto-Interp
    Negative Logits
    usterity
    -0.78
    CG
    -0.67
    urgy
    -0.67
    automatic
    -0.65
    UST
    -0.63
    ulative
    -0.62
     Powered
    -0.62
     stabilized
    -0.62
    ulators
    -0.60
     gearing
    -0.60
    POSITIVE LOGITS
    evil
    1.04
     defy
    0.95
     presume
    0.83
     dare
    0.80
     disagree
    0.79
    ously
    0.75
     argue
    0.73
     disob
    0.73
     undertake
    0.73
    word
    0.73
    Act Density 0.034%

    No Known Activations