INDEX
    Explanations

    references to liberalism and its various forms and implications

    New Auto-Interp
    Negative Logits
    ersen
    -0.15
    ey
    -0.15
    plier
    -0.15
    uld
    -0.15
    ese
    -0.15
    ej
    -0.15
    iq
    -0.14
    alian
    -0.14
    iles
    -0.14
    oth
    -0.14
    POSITIVE LOGITS
    ised
    0.18
    ornings
    0.16
    /lib
    0.15
    onec
    0.15
    hift
    0.15
    ATA
    0.14
    ising
    0.14
    uyến
    0.14
    -leaning
    0.14
    entiful
    0.14
    Act Density 0.008%

    No Known Activations