INDEX
    Explanations

    words related to control, regulation, or enforcement

    New Auto-Interp
    Negative Logits
    lm
    -0.20
    lp
    -0.19
    latex
    -0.19
    l
    -0.19
    egr
    -0.18
    lx
    -0.18
    lv
    -0.18
    ri
    -0.17
    ls
    -0.17
    ono
    -0.17
    POSITIVE LOGITS
    ̧
    0.27
    raft
    0.23
     heck
    0.21
    chio
    0.21
    chia
    0.21
    s
    0.20
    ourt
    0.19
    eneg
    0.19
    ussion
    0.19
    avity
    0.19
    Act Density 0.216%

    No Known Activations