INDEX
    Explanations

    phrases related to arguing or debating

    New Auto-Interp
    Negative Logits
     Cosponsors
    -0.87
    til
    -0.77
    marked
    -0.73
    ned
    -0.72
    DragonMagazine
    -0.71
    falls
    -0.70
    typ
    -0.70
    iste
    -0.67
    ledged
    -0.67
    maker
    -0.63
    POSITIVE LOGITS
     preserving
    1.19
     avoiding
    1.03
     keeping
    0.99
     protecting
    0.98
     accuracy
    0.96
     fairness
    0.96
     sanity
    0.94
     maintaining
    0.93
     realism
    0.92
     secrecy
    0.92
    Act Density 0.053%

    No Known Activations