INDEX
    Explanations

    open weights model explains copy

    New Auto-Interp
    Negative Logits
    nouns
    0.40
    一个个
    0.40
    created
    0.39
    kommer
    0.39
     created
    0.37
     சேர்த்த
    0.37
    name
    0.36
    lear
    0.36
    NAME
    0.36
    Fala
    0.36
    POSITIVE LOGITS
     copies
    2.11
     Copies
    1.81
    Copies
    1.75
    copies
    1.71
     copy
    1.65
     копия
    1.52
     копи
    1.44
     copie
    1.41
    コピー
    1.31
    副本
    1.30
    Act Density 0.020%

    No Known Activations