INDEX
    Explanations

    comments or annotations in code documentation

    New Auto-Interp
    Negative Logits
    OGND
    -0.81
     zoude
    -0.81
    adeloupe
    -0.78
    genodigd
    -0.77
    mpagne
    -0.76
    UserScript
    -0.76
     stiefe
    -0.75
     laſſen
    -0.74
     ब्रेकडाउन
    -0.73
     للمعارف
    -0.72
    POSITIVE LOGITS
    //
    0.91
    ///
    0.91
     *
    0.86
    *
    0.81
    //
    0.64
    [toxicity=0]
    0.64
     //
    0.63
    #
    0.56
    0.55
    +//
    0.55
    Act Density 0.092%

    No Known Activations