INDEX
    Explanations
    No Explanations Found
    New Auto-Interp
    Negative Logits
    -0.77
    -0.76
     no
    -0.76
      
    -0.75
     a
    -0.74
    ↵↵
    -0.71
     not
    -0.70
     as
    -0.70
     o
    -0.70
     an
    -0.68
    POSITIVE LOGITS
    <bos>
    8.17
     🤣🤣
    1.89
     exé
    1.79
     ftu
    1.75
     »>
    1.67
    <?
    1.67
    1.66
     whofe
    1.66
     ftre
    1.66
     fince
    1.66
    Act Density 0.000%

    No Known Activations

    This feature has no known activations.