INDEX
    Explanations

    ambiguous or irrelevant content, as there is no consistent theme or pattern in the activations

    specific phrases or references to entities associated with style or behavior in various contexts

    New Auto-Interp
    Negative Logits
     <@
    -0.70
     gib
    -0.68
     Gmail
    -0.66
     decomp
    -0.66
     "+
    -0.65
     JPEG
    -0.65
     +++
    -0.64
     scrut
    -0.61
     fortun
    -0.61
     "<
    -0.61
    POSITIVE LOGITS
    s
    1.53
    ski
    1.10
    scl
    1.05
    ses
    1.02
    ship
    1.02
    t
    1.02
    d
    0.98
    ved
    0.97
    tis
    0.97
    tal
    0.95
    Act Density 0.338%

    No Known Activations