INDEX
    Explanations

    restriction, limiting actions

    New Auto-Interp
    Negative Logits
    wy
    -0.77
     Majefty
    -0.71
     Merit
    -0.68
     merit
    -0.66
     виправивши
    -0.63
     Anſ
    -0.62
     ſy
    -0.62
     themſelves
    -0.61
     iſt
    -0.60
     Theſe
    -0.59
    POSITIVE LOGITS
    ########.
    0.71
     lệ
    0.63
    y
    0.61
    :✨
    0.54
    unk
    0.54
    DoubleQuotes
    0.53
     rzecz
    0.52
    ers
    0.51
    verwijspagina
    0.49
    t
    0.49
    Act Density 0.552%

    No Known Activations