INDEX
    Explanations

    references to innocence and the protection of innocent individuals

    New Auto-Interp
    Negative Logits
    Mathias
    -0.72
    ilangkan
    -0.65
     وتسجيلات
    -0.61
    shifted
    -0.60
    agot
    -0.60
     visst
    -0.59
     Pelham
    -0.59
    impresa
    -0.57
    arım
    -0.57
     Suivez
    -0.57
    POSITIVE LOGITS
     innocent
    1.93
    Innoc
    1.77
     Innocent
    1.77
    innocent
    1.70
     innocence
    1.64
    innoc
    1.58
     innoc
    1.52
     Innocence
    1.48
     inocente
    1.35
     Innoc
    1.33
    Act Density 0.175%

    No Known Activations