INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    کر
    -1.01
    -0.92
     beware
    -0.88
    -0.87
    一年
    -0.87
    🥾
    -0.85
     unwilling
    -0.85
    abbr
    -0.84
    bors
    -0.84
    зор
    -0.84
    POSITIVE LOGITS
     everything
    1.88
     relax
    1.79
    Everything
    1.63
     reassured
    1.57
     Relax
    1.57
    everything
    1.52
     Everything
    1.45
     it
    1.45
    relax
    1.45
    Relax
    1.43
    Act Density 0.027%

    No Known Activations