INDEX
    Explanations
    No Explanations Found
    New Auto-Interp
    Negative Logits
     -
    -0.29
    --
    -0.27
     --
    -0.22
    ---
    -0.19
    ...↵↵
    -0.19
    ...
    -0.19
    ...↵
    -0.16
     :
    -0.16
    ighbor
    -0.16
     ...
    -0.16
    POSITIVE LOGITS
     fucking
    0.22
     fuck
    0.22
     fucks
    0.22
     fucked
    0.21
    Fuck
    0.20
     Fuck
    0.18
    fuck
    0.18
     FUCK
    0.18
    ิà¸ļ
    0.17
     Streaming
    0.17
    Act Density 0.000%

    No Known Activations

    This feature has no known activations.