INDEX
    Explanations

    terms related to effectiveness and impact

    New Auto-Interp
    Negative Logits
    ffects
    -0.21
     Effects
    -0.19
     effects
    -0.18
    affected
    -0.18
    erable
    -0.18
     effectively
    -0.18
    Effects
    -0.18
    _effects
    -0.17
    ffect
    -0.17
     affected
    -0.17
    POSITIVE LOGITS
    iveness
    0.36
    ual
    0.31
    ively
    0.29
    uate
    0.28
    ives
    0.28
    ors
    0.28
    uated
    0.27
    ivity
    0.26
    uating
    0.26
    ivement
    0.22
    Act Density 0.057%

    No Known Activations