INDEX
    Explanations
    No Explanations Found
    New Auto-Interp
    Negative Logits
     kindly
    -0.69
     DEBUG
    -0.67
    ourage
    -0.66
     brave
    -0.66
    handed
    -0.66
     ear
    -0.65
    gotten
    -0.65
     gaps
    -0.64
     ACL
    -0.64
     loopholes
    -0.64
    POSITIVE LOGITS
    psc
    0.77
    ̶
    0.73
    tumblr
    0.73
    OUP
    0.68
    JD
    0.67
     Mond
    0.66
    UA
    0.66
    oa
    0.66
    ITE
    0.65
    lez
    0.64
    Act Density 0.000%

    No Known Activations

    This feature has no known activations.