INDEX
    Explanations

    phrases indicating substitution or replacement of concepts or items

    New Auto-Interp
    Negative Logits
    azon
    -0.17
    vla
    -0.16
    ENCHMARK
    -0.15
     Rew
    -0.14
    agle
    -0.14
    нÑĮ
    -0.14
    agrid
    -0.14
    ovu
    -0.14
    uren
    -0.14
     Bever
    -0.14
    POSITIVE LOGITS
     substitute
    0.27
    replace
    0.27
     replace
    0.26
     replacing
    0.24
     replaces
    0.23
     substit
    0.22
     replacement
    0.22
     Replace
    0.21
     substitution
    0.21
     replaced
    0.20
    Act Density 0.097%

    No Known Activations