INDEX
    Explanations

    expressions of negation or denial

    New Auto-Interp
    Negative Logits
    ward
    -0.19
    ulle
    -0.17
    ged
    -0.16
    stu
    -0.16
    ried
    -0.14
    _UNUSED
    -0.14
    named
    -0.14
    cessive
    -0.14
    airo
    -0.14
    ?q
    -0.14
    POSITIVE LOGITS
    oint
    0.18
     longer
    0.17
     matter
    0.17
     doubt
    0.15
     differently
    0.15
    xious
    0.15
    obs
    0.15
     sooner
    0.14
    theless
    0.14
    ScreenWidth
    0.14
    Act Density 0.036%

    No Known Activations