INDEX
    Explanations

    references to deceit or dishonesty

    New Auto-Interp
    Negative Logits
     оригіналу
    -0.74
     faſt
    -0.70
     eſt
    -0.68
     Jefus
    -0.67
     houſe
    -0.67
     uſe
    -0.67
     updates
    -0.67
     pleaf
    -0.65
     uſ
    -0.63
     ſp
    -0.63
    POSITIVE LOGITS
     lie
    1.35
     lies
    1.15
     LIE
    0.85
     liegen
    0.83
    Lie
    0.83
     lied
    0.81
     lying
    0.80
    windowFixed
    0.80
     laid
    0.80
     Lie
    0.79
    Act Density 0.104%

    No Known Activations