INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    a
    0.79
    0.73
    0.67
    もちろん
    0.63
    ה
    0.63
    ه
    0.62
    withstanding
    0.61
    v
    0.61
    يل
    0.61
    ǜ
    0.61
    POSITIVE LOGITS
    م
    0.78
    m
    0.74
    0.74
    да
    0.74
    )
    0.68
    0
    0.66
    0.65
    м
    0.64
    ן
    0.64
    ம்
    0.57
    Act Density 0.020%

    No Known Activations