INDEX
    Explanations

    instances of colons indicating explanations or lists

    phrases related to reasoning and justification

    New Auto-Interp
    Negative Logits
    ascus
    -0.76
    thur
    -0.74
    apsed
    -0.72
    agraph
    -0.67
    rez
    -0.67
    é¾
    -0.66
    Fuck
    -0.66
    emort
    -0.66
    ocide
    -0.66
    cember
    -0.65
    POSITIVE LOGITS
     reducing
    1.00
     it
    0.97
     reduces
    0.95
     facilitating
    0.92
     lowering
    0.89
     lowers
    0.88
     increased
    0.88
     increases
    0.86
     they
    0.86
     eliminating
    0.86
    Act Density 0.337%

    No Known Activations