INDEX
    Explanations

    adjectives describing intensity or severity

    terms describing mildness or pleasantness

    New Auto-Interp
    Negative Logits
    aucus
    -0.67
     Accountability
    -0.66
    hedral
    -0.66
    rencies
    -0.65
    lining
    -0.64
    aturated
    -0.63
    ilings
    -0.62
    funding
    -0.60
    etus
    -0.60
     Emir
    -0.60
    POSITIVE LOGITS
    hello
    0.79
     harmless
    0.78
     annoyance
    0.77
     surprise
    0.76
     surprises
    0.75
     nuisance
    0.75
    »Ĵ
    0.75
     surpr
    0.73
    ew
    0.71
     prank
    0.70
    Act Density 0.091%

    No Known Activations