INDEX
    Explanations

    descriptions of things as "nice"

    the repeated use of the word "nice."

    New Auto-Interp
    Negative Logits
    arians
    -0.79
    ogens
    -0.79
    authorized
    -0.79
    arers
    -0.76
    arian
    -0.75
    ochond
    -0.75
    inant
    -0.74
    rained
    -0.74
    uilding
    -0.71
    igate
    -0.69
    POSITIVE LOGITS
     nice
    0.92
     bye
    0.86
     fluffy
    0.81
    bye
    0.80
     touches
    0.78
     additions
    0.78
     little
    0.75
     enough
    0.75
     neat
    0.74
     bonus
    0.74
    Act Density 0.019%

    No Known Activations