INDEX
    Explanations

    Numerical ranges and approximations

    New Auto-Interp
    Negative Logits
     presumably
    -0.10
     incompetent
    -0.08
     irrelevant
    -0.08
     murderous
    -0.08
    cov
    -0.08
     trivial
    -0.08
    !!!↵
    -0.08
     violation
    -0.07
     complied
    -0.07
    Lucky
    -0.07
    POSITIVE LOGITS
     takeaway
    0.10
    %左右
    0.10
     среднего
    0.09
     süd
    0.09
     рекомендуется
    0.09
    راوح
    0.09
     moderately
    0.09
     вари
    0.09
     примерно
    0.09
     정도
    0.09
    Act Density 0.064%

    No Known Activations