INDEX
    Explanations

    words related to propaganda and opinion-sharing

    New Auto-Interp
    Negative Logits
    reat
    -0.68
    ritic
    -0.66
    yip
    -0.64
    knife
    -0.62
    pperc
    -0.62
    atro
    -0.62
    atches
    -0.62
    thren
    -0.60
    FH
    -0.60
    rients
    -0.58
    POSITIVE LOGITS
    ocating
    1.03
    uding
    1.02
    usion
    0.95
    uring
    0.94
     sorts
    0.90
    ocated
    0.87
     kinds
    0.81
    ocation
    0.81
    usions
    0.80
    owing
    0.79
    Act Density 0.045%

    No Known Activations