INDEX
    Explanations

    expressions of compassion or altruism

    New Auto-Interp
    Negative Logits
    ,’”
    -0.31
    ,”
    -0.30
    ,’
    -0.28
    -0.26
     “[
    -0.25
    ,’’
    -0.25
    =”
    -0.25
    .”
    -0.24
    -0.24
    ,“
    -0.23
    POSITIVE LOGITS
     "
    0.58
     '
    0.52
    's
    0.50
    'll
    0.48
    've
    0.47
    're
    0.46
    'm
    0.44
    'd
    0.43
     ("
    0.42
    't
    0.40
    Act Density 3.023%

    No Known Activations