INDEX
    Explanations

    phrases expressing personal feelings or opinions

    expressions of personal feelings or comparisons

    New Auto-Interp
    Negative Logits
    ircraft
    -0.86
    hiba
    -0.83
    abases
    -0.83
    alt
    -0.81
    ouched
    -0.76
    alez
    -0.75
    ells
    -0.71
    arry
    -0.70
    ourse
    -0.70
    arling
    -0.70
    POSITIVE LOGITS
    lier
    0.86
    liest
    0.73
     parity
    0.73
     calling
    0.71
     crap
    0.70
     picking
    0.67
     slipping
    0.66
     spitting
    0.66
    lihood
    0.65
    ¥µ
    0.64
    Act Density 0.025%

    No Known Activations