INDEX
    Explanations

    references to specific data sources or frameworks

    New Auto-Interp
    Negative Logits
     pleaſure
    -0.68
     purpoſe
    -0.67
     ſever
    -0.65
     faſt
    -0.65
     myſelf
    -0.65
     juſ
    -0.62
     tranſ
    -0.59
     inſ
    -0.59
     itſelf
    -0.58
     viſ
    -0.58
    POSITIVE LOGITS
     dari
    1.65
    Dari
    1.18
    dari
    1.15
     Dari
    1.14
     FROM
    1.06
     from
    1.06
    จาก
    1.05
     från
    0.97
     From
    0.97
    from
    0.97
    Act Density 0.001%

    No Known Activations