INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     gravy
    -0.88
    ostar
    -0.74
     obviously
    -0.74
     what
    -0.73
     shook
    -0.73
     ограни
    -0.72
     fantasia
    -0.71
     described
    -0.70
     principi
    -0.69
     Alamofire
    -0.68
    POSITIVE LOGITS
     label
    0.90
     labels
    0.86
     jargon
    0.84
     cubrir
    0.82
    LBL
    0.82
     phrase
    0.81
     term
    0.79
     samlet
    0.77
    PLEASE
    0.77
     lbl
    0.77
    Act Density 0.018%

    No Known Activations