INDEX
    Explanations

    adjectives describing emotions or judgments

    discussions about controversial or unsettling topics

    New Auto-Interp
    Negative Logits
    umbn
    -0.82
    rongh
    -0.78
    foreseen
    -0.75
    execute
    -0.74
    ocument
    -0.71
    cele
    -0.71
    sylvania
    -0.71
    mediately
    -0.70
    obook
    -0.69
    ufact
    -0.68
    POSITIVE LOGITS
     huh
    1.49
     eh
    1.41
     tho
    1.04
    ?!
    1.01
    !
    1.00
    !!
    0.95
    !?
    0.95
     ya
    0.92
     ðŁĺ
    0.85
     kidding
    0.84
    Act Density 0.525%

    No Known Activations