INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     ModelExpression
    -0.48
    IDTH
    -0.46
     تضيفلها
    -0.44
    homonymie
    -0.43
     (__
    -0.42
    angsaan
    -0.42
     ویکی‌پدیا
    -0.42
    eably
    -0.41
    bmatrix
    -0.41
     (
    -0.41
    POSITIVE LOGITS
    !),
    0.77
    ),”
    0.76
    !');
    0.75
    !')
    0.73
    !)
    0.73
    !).
    0.73
    ),"
    0.71
    !';
    0.71
    ).-
    0.71
    !',
    0.71
    Act Density 0.000%

    No Known Activations