INDEX
    Explanations

    negations and contrasts in value-based arguments

    New Auto-Interp
    Negative Logits
    avery
    -0.17
    453
    -0.16
    ibold
    -0.15
    435
    -0.15
    uben
    -0.15
     pau
    -0.14
    StateManager
    -0.14
    ree
    -0.14
    anas
    -0.14
    ffc
    -0.14
    POSITIVE LOGITS
     merely
    0.19
    åıªæĺ¯
    0.17
    isol
    0.15
     solely
    0.15
     chased
    0.15
     simply
    0.15
    ë§Į
    0.15
     cookie
    0.15
     mere
    0.14
     juste
    0.14
    Act Density 0.151%

    No Known Activations