INDEX
    Explanations

    news articles

    New Auto-Interp
    Negative Logits
     purpoſe
    -0.71
     pleaſure
    -0.68
    ſelves
    -0.67
     cauſe
    -0.65
    İY
    -0.63
     ſta
    -0.61
     itſelf
    -0.60
     fubject
    -0.60
     gani
    -0.60
     ſtate
    -0.60
    POSITIVE LOGITS
     like
    1.50
     such
    1.17
     вроде
    0.91
    such
    0.89
     seperti
    0.84
     như
    0.83
     مثل
    0.79
     zoals
    0.78
    0.77
    like
    0.77
    Act Density 0.001%

    No Known Activations