INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    ;</
    -0.06
    (attribute
    -0.06
    mino
    -0.06
    ları
    -0.06
    _);↵
    -0.06
     pornografia
    -0.06
    aston
    -0.06
     ();↵
    -0.06
     fear
    -0.06
    рами
    -0.06
    POSITIVE LOGITS
    .poster
    0.08
     trigger
    0.07
    0.07
    0.06
    <Component
    0.06
    ===============
    0.06
     cigarette
    0.06
     heir
    0.06
    Called
    0.06
    Allowed
    0.06
    Act Density 0.000%

    No Known Activations