INDEX
    Explanations

    phrases indicating negation or disbelief

    New Auto-Interp
    Negative Logits
    ,
    -0.15
    unar
    -0.15
    лаж
    -0.15
    reon
    -0.15
    ledge
    -0.14
     ControllerBase
    -0.14
    oron
    -0.14
    lied
    -0.13
    à¹Ģลย
    -0.13
    afort
    -0.13
    POSITIVE LOGITS
     gusta
    0.24
     gust
    0.23
     hub
    0.21
     ha
    0.20
     falta
    0.19
    jos
    0.19
     han
    0.19
     import
    0.19
     pid
    0.18
     puedo
    0.18
    Act Density 0.027%

    No Known Activations