INDEX
    Explanations

    preserving knowledge, texture, or flavor

    New Auto-Interp
    Negative Logits
    a
    1.53
    l
    1.18
    c
    1.13
    an
    1.09
    t
    1.02
    0
    0.92
    0.90
     a
    0.90
    0.90
     the
    0.89
    POSITIVE LOGITS
    ו
    1.48
    و
    1.38
    ן
    1.23
    ומי
    1.12
    יא
    1.10
    I
    1.10
    ל
    1.10
    תה
    0.96
    ווע
    0.93
    ри
    0.93
    Act Density 0.005%

    No Known Activations