INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ia
    0.89
    a
    0.84
    iam
    0.83
    an
    0.82
    sp
    0.79
    del
    0.74
    ied
    0.73
    ii
    0.73
     $
    0.72
     an
    0.72
    POSITIVE LOGITS
     pace
    1.02
    0.93
    يق
    0.86
    verhalten
    0.85
    人气
    0.84
    ع
    0.84
     demean
    0.80
    0.79
    ك
    0.79
    0.77
    Act Density 0.004%

    No Known Activations