INDEX
    Explanations

    words related to taking actions, particularly in a regulatory or enforcement context

    New Auto-Interp
    Negative Logits
    oran
    -0.16
    šov
    -0.16
    972
    -0.14
    vard
    -0.14
    orning
    -0.14
    603
    -0.14
    617
    -0.14
    605
    -0.14
     kalk
    -0.14
    strap
    -0.14
    POSITIVE LOGITS
     steps
    0.33
     concrete
    0.24
     Steps
    0.24
     firm
    0.23
     measures
    0.23
    steps
    0.23
    Steps
    0.22
     appropriate
    0.22
     necessary
    0.21
     fir
    0.21
    Act Density 0.036%

    No Known Activations