INDEX
    Explanations

    phrases related to moral warnings or consequences

    New Auto-Interp
    Negative Logits
    ãģĵãĤĵãģª
    -0.15
     nÃły
    -0.14
    è¿Ļ个
    -0.13
    loh
    -0.13
     estas
    -0.13
     esta
    -0.13
    aji
    -0.12
    δÎŃ
    -0.12
    oji
    -0.12
    rary
    -0.12
    POSITIVE LOGITS
     those
    1.27
    those
    1.12
     Those
    1.03
    Those
    0.98
    éĤ£äºĽ
    0.90
     ones
    0.80
     ceux
    0.80
     tÄĽch
    0.62
     celui
    0.54
    ones
    0.52
    Act Density 0.578%

    No Known Activations