INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    و
    0.95
    o
    0.84
        
    0.75
    h
    0.73
    0.72
    ный
    0.71
    al
    0.70
    ل
    0.67
     불구하고
    0.67
    वळ
    0.66
    POSITIVE LOGITS
    0
    0.95
    ຖືກ
    0.81
    ths
    0.79
     fêtes
    0.78
     addicts
    0.77
    truths
    0.77
     piccola
    0.77
     Jokes
    0.77
    দাতা
    0.76
     Props
    0.75
    Act Density 0.005%

    No Known Activations