INDEX
    Explanations

    phrases related to responsibility or critique towards particular individuals

    New Auto-Interp
    Negative Logits
    urations
    -0.60
     Esp
    -0.60
    Bron
    -0.58
    orth
    -0.58
    ibaba
    -0.58
     Membership
    -0.57
     Shutterstock
    -0.57
     Trinity
    -0.56
    tesque
    -0.56
    Louis
    -0.55
    POSITIVE LOGITS
     responsible
    0.89
    liest
    0.84
     deciding
    0.82
     happiest
    0.79
     who
    0.73
    abet
    0.72
    responsible
    0.72
     reacting
    0.71
     initiating
    0.69
     risking
    0.69
    Act Density 0.098%

    No Known Activations