INDEX
    Explanations

    refusal of inappropriate requests

    New Auto-Interp
    Negative Logits
    entar
    0.47
    ukaemia
    0.47
     '
    0.45
     Entropy
    0.42
     Euros
    0.42
     Lines
    0.41
     Ennis
    0.41
     euros
    0.41
     Opening
    0.41
     euro
    0.40
    POSITIVE LOGITS
     độ
    0.43
    தன்
    0.41
    लन
    0.41
    继承
    0.41
    登山
    0.41
     quân
    0.40
     gehe
    0.40
     ছোট
    0.40
     الذين
    0.40
    لس
    0.40
    Act Density 0.004%

    No Known Activations