INDEX
    Explanations

    deliberately manipulative

    New Auto-Interp
    Negative Logits
    ர்ந்த
    0.50
     вашей
    0.50
    enste
    0.48
     народ
    0.48
     wd
    0.48
     నమో
    0.47
    0.47
     వంటి
    0.46
    骑士
    0.46
    <unused635>
    0.46
    POSITIVE LOGITS
    These
    0.59
     these
    0.57
    টারে
    0.47
    these
    0.44
    這些
    0.44
     AIDS
    0.43
     These
    0.43
     protease
    0.43
     rac
    0.43
    ये
    0.42
    Act Density 0.001%

    No Known Activations