INDEX
    Explanations

    expressions of strong emotions and opinions regarding justice and fairness

    New Auto-Interp
    Negative Logits
    .*↵↵
    -0.18
    .*↵
    -0.18
    .↵↵
    -0.15
    .*,
    -0.15
    ).*
    -0.15
    .)↵↵
    -0.15
    !*
    -0.15
    .*/↵
    -0.14
    ."""↵↵
    -0.14
    ).↵↵
    -0.14
    POSITIVE LOGITS
     its
    0.28
     cant
    0.23
     hope
    0.22
     ive
    0.22
     ,,
    0.22
     dont
    0.21
     iam
    0.21
    .look
    0.21
    .im
    0.20
    .i
    0.20
    Act Density 1.024%

    No Known Activations