INDEX
    Explanations

    safety reasons for prohibitions

    New Auto-Interp
    Negative Logits
     tokens
    0.44
    Tokens
    0.42
     Boul
    0.41
     Tomas
    0.40
    トーク
    0.39
    tokens
    0.39
    0.37
     Token
    0.37
    一条
    0.36
    গুলোকে
    0.36
    POSITIVE LOGITS
     Aside
    0.43
    理由
    0.42
    amatsu
    0.40
     threefold
    0.40
     unido
    0.39
     aside
    0.39
     ngunit
    0.38
    கிறது
    0.38
     lakini
    0.38
     INCLUDING
    0.37
    Act Density 0.089%

    No Known Activations