INDEX
    Explanations

    helpful AI assistant refusing requests

    New Auto-Interp
    Negative Logits
     निर्णायक
    0.35
     layoffs
    0.34
    毎年
    0.33
    brak
    0.32
     governing
    0.32
    Closest
    0.32
    props
    0.32
     gemeente
    0.32
     classed
    0.31
    Authorization
    0.31
    POSITIVE LOGITS
     поэтому
    0.42
     Поэтому
    0.42
     což
    0.40
    Therefore
    0.35
    нома
    0.35
     derfor
    0.35
     Therefore
    0.35
    므로
    0.34
    だったので
    0.34
     harmless
    0.34
    Act Density 0.272%

    No Known Activations