INDEX
    Explanations

    states of being or roles

    New Auto-Interp
    Negative Logits
    ת
    0.80
    ל
    0.75
    ו
    0.74
    ה
    0.71
    ع
    0.69
    ע
    0.67
    W
    0.58
    علم
    0.55
    علي
    0.53
    د
    0.52
    POSITIVE LOGITS
     are
    0.52
    ö
    0.49
    ua
    0.48
    uksen
    0.47
    রামর্শ
    0.46
    िट
    0.45
     Knitting
    0.44
    ą
    0.44
     skupiny
    0.43
     skiing
    0.42
    Act Density 1.325%

    No Known Activations