INDEX
    Explanations

    references to authority figures or influential individuals

    New Auto-Interp
    Negative Logits
    cco
    -0.16
    ouz
    -0.15
    oard
    -0.15
    atrice
    -0.15
    ombok
    -0.15
    arie
    -0.14
    ÏĤ
    -0.14
    fix
    -0.14
    allah
    -0.14
     Brilliant
    -0.14
    POSITIVE LOGITS
     sturdy
    0.20
     functionalities
    0.18
     men
    0.18
    paced
    0.16
     invalid
    0.16
     Scotch
    0.15
     turbulent
    0.15
     honest
    0.15
     gentle
    0.15
     ye
    0.15
    Act Density 0.395%

    No Known Activations