INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    </strong>
    -4.00
    ↵↵
    -3.06
          
    -2.81
        
    -2.77
    at
    -2.73
    ii
    -2.72
     craz
    -2.69
    u
    -2.59
      
    -2.55
     genannten
    -2.53
    POSITIVE LOGITS
    Although
    2.73
    </tfoot>
    2.72
    Even
    2.64
    2.53
    Surprisingly
    2.53
    Nearly
    2.52
    Despite
    2.48
    While
    2.47
     {
    
    
    2.45
    Usually
    2.44
    Act Density 0.009%

    No Known Activations