INDEX
    Explanations

    identifying official context method deviations

    New Auto-Interp
    Negative Logits
    but
    0.36
     B
    0.35
     Rostov
    0.34
    b
    0.34
     Bari
    0.33
    Roth
    0.33
     Wirtschaft
    0.33
     Marian
    0.32
     sitcom
    0.32
     Alsace
    0.32
    POSITIVE LOGITS
    0.35
     inducement
    0.34
    0.34
    0.33
     embarrassment
    0.32
     milligrams
    0.31
     annoyance
    0.31
     そんな
    0.31
     hydration
    0.31
    ргә
    0.31
    Act Density 1.134%

    No Known Activations