INDEX
    Explanations

    harmful or exploitative content

    New Auto-Interp
    Negative Logits
    (
    0.79
    <h1>
    0.75
    Table
    0.72
    Draw
    0.70
    0.70
    0.69
    See
    0.67
    ---
    0.67
    View
    0.65
    #
    0.64
    POSITIVE LOGITS
     LEC
    0.88
     Oversight
    0.87
     incapacity
    0.86
    <unused1888>
    0.86
    <unused368>
    0.85
     مذہبی
    0.84
     russe
    0.83
    <unused1044>
    0.83
    <unused2145>
    0.83
     hating
    0.83
    Act Density 0.350%

    No Known Activations