INDEX
    Explanations

    phrases indicating joking or exaggeration

    expressions indicating humor or sarcasm

    New Auto-Interp
    Negative Logits
     CTR
    -0.64
    roots
    -0.59
     Bridges
    -0.58
    iliated
    -0.58
     manned
    -0.56
    reens
    -0.56
    icipated
    -0.56
     intersect
    -0.56
     scrim
    -0.56
     competed
    -0.55
    POSITIVE LOGITS
    ^^^^
    0.86
    _.
    0.85
     kidding
    0.82
     haha
    0.79
     ;)
    0.79
     :-)
    0.76
     myself
    0.75
     :)
    0.74
    idge
    0.73
     here
    0.72
    Act Density 0.403%

    No Known Activations