INDEX
    Explanations

    lead to negative consequences

    New Auto-Interp
    Negative Logits
     щоб
    0.82
    voor
    0.77
    力和
    0.74
     mandato
    0.72
    upon
    0.69
    avaa
    0.67
     pentru
    0.66
    ':[
    0.66
     gegen
    0.65
    для
    0.65
    POSITIVE LOGITS
     nowhere
    0.90
     anywhere
    0.77
     astray
    0.73
    డ్డు
    0.72
    0.70
     productive
    0.69
     reproducing
    0.68
     المخت
    0.68
     कहीं
    0.68
     stagnation
    0.68
    Act Density 0.073%

    No Known Activations