INDEX
    Explanations

    elements related to identity or personal attributes

    Follows dialogue or a question

    Well, Okay, Yes, judge, Prior

    New Auto-Interp
    Negative Logits
     itſelf
    -1.09
    ()?;
    -0.99
    .";
    
    -0.98
     فريبيس
    -0.96
    =?";
    -0.94
    )";
    
    -0.94
     ſind
    -0.93
    ✨:
    -0.92
    ".
    
    -0.91
    %";
    -0.90
    POSITIVE LOGITS
     I
    0.84
     you
    0.66
    !
    0.64
    .
    0.63
     [
    0.62
    I
    0.61
    ,
    0.60
     he
    0.59
     (
    0.58
     because
    0.53
    Act Density 0.155%

    No Known Activations