INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    s
    -0.91
    i
    -0.80
     hates
    -0.78
     surla
    -0.77
     hate
    -0.76
    ship
    -0.76
     NSCoder
    -0.76
    ه
    -0.75
     Hate
    -0.74
    e
    -0.74
    POSITIVE LOGITS
    bige
    0.42
    RunAsync
    0.40
    bewerken
    0.37
    wikimedia
    0.37
    wildcard
    0.36
    cuma
    0.36
    Thunk
    0.35
    abras
    0.35
    ucca
    0.35
    Cuz
    0.35
    Act Density 0.083%

    No Known Activations