INDEX
    Explanations

    specific movie titles and references to popular media

    New Auto-Interp
    Negative Logits
    <bos>
    -3.37
    -0.94
    הת
    -0.74
    הח
    -0.74
    SequentialGroup
    -0.71
    הע
    -0.70
    Identyfik
    -0.70
     nawr
    -0.69
    <?
    
    -0.69
    #![
    -0.69
    POSITIVE LOGITS
     maneu
    2.28
     impra
    2.07
     increa
    1.98
     accla
    1.97
     disagre
    1.95
     emphat
    1.93
     shenan
    1.92
     depic
    1.86
     reluct
    1.86
     inev
    1.86
    Act Density 0.270%

    No Known Activations