INDEX
    Explanations

    declining inappropriate content requests

    New Auto-Interp
    Negative Logits
    кут
    0.32
     چین
    0.31
     Ці
    0.31
    0.30
    Χ
    0.29
    𝐯
    0.29
    0.29
    пі
    0.29
    其中
    0.29
     Χ
    0.28
    POSITIVE LOGITS
    ...</
    0.61
    ..."
    0.59
    …”
    0.59
     ..."
    0.58
    …"
    0.55
     …”
    0.54
    -...
    0.54
    …</
    0.53
    ……”
    0.53
    ...">
    0.53
    Act Density 0.020%

    No Known Activations