INDEX
    Explanations

    according to definitions

    New Auto-Interp
    Negative Logits
     historically
    0.48
     sorely
    0.44
     sucede
    0.42
     invariably
    0.40
    0.39
     philosoph
    0.39
     завжди
    0.39
     philosophical
    0.39
     задума
    0.39
    Historically
    0.38
    POSITIVE LOGITS
     according
    0.66
     Según
    0.66
     According
    0.65
     volgens
    0.64
    According
    0.62
     Menurut
    0.62
    according
    0.59
     podľa
    0.59
    Menurut
    0.58
     따라서
    0.58
    Act Density 0.025%

    No Known Activations