INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     literalmente
    0.88
     meaning
    0.79
    literally
    0.78
     literally
    0.74
     figuratively
    0.73
     slang
    0.72
     shenanigans
    0.70
     histories
    0.69
    льга
    0.69
     craziness
    0.68
    POSITIVE LOGITS
     wynosi
    1.08
     varies
    1.04
     beträgt
    1.03
     vary
    1.03
     exceeds
    0.97
     totalled
    0.92
     составляет
    0.87
     varía
    0.86
     outweigh
    0.84
    约为
    0.82
    Act Density 0.390%

    No Known Activations