INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     keep
    -0.07
    בטיח
    -0.07
    -0.07
    -0.06
    	In
    -0.06
     might
    -0.06
    ETING
    -0.06
     pool
    -0.06
    -0.06
     Gold
    -0.06
    POSITIVE LOGITS
     unjust
    0.08
    Impossible
    0.08
     airborne
    0.08
    _scores
    0.08
     crumbling
    0.08
     Nazi
    0.08
     fraudulent
    0.08
     propositions
    0.07
     cosas
    0.07
     mostr
    0.07
    Act Density 0.003%

    No Known Activations