INDEX
    Explanations

    words related to incorrectness or mistakes

    phrases that indicate incorrectness or undesirable outcomes

    New Auto-Interp
    Negative Logits
    doms
    -0.80
    dom
    -0.74
    thood
    -0.71
    lishes
    -0.71
    anism
    -0.71
    zeb
    -0.68
    punk
    -0.67
    renheit
    -0.67
    archives
    -0.67
    rs
    -0.66
    POSITIVE LOGITS
     amount
    0.97
     thing
    0.96
     balance
    0.87
     combination
    0.86
     solution
    0.85
     kind
    0.84
     antidote
    0.82
     way
    0.80
     piece
    0.80
     side
    0.79
    Act Density 0.039%

    No Known Activations