INDEX
    Explanations

    sentiments and expressions of surprise or disbelief

    New Auto-Interp
    Negative Logits
    erif
    -0.17
    jav
    -0.14
    .sig
    -0.14
    arella
    -0.14
    riott
    -0.14
     Surre
    -0.14
     kum
    -0.14
    ãĢħ
    -0.14
    ersist
    -0.14
    contr
    -0.14
    POSITIVE LOGITS
    386
    0.18
    673
    0.15
    zee
    0.14
    ãĥªãĥ³
    0.14
    pek
    0.14
    Ads
    0.14
    nc
    0.14
    isha
    0.14
    ames
    0.13
    лаз
    0.13
    Act Density 0.061%

    No Known Activations