INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     bearer
    -0.81
    pired
    -0.69
    itute
    -0.68
    VALUE
    -0.68
    lishes
    -0.67
    cffff
    -0.67
    ~~~~~~~~~~~~~~~~
    -0.67
    enture
    -0.67
    nesses
    -0.65
    teen
    -0.65
    POSITIVE LOGITS
    son
    1.17
     Gorsuch
    1.12
     Strauss
    0.87
     Armstrong
    0.87
     Tyson
    0.83
    ard
    0.80
     Patel
    0.79
    aiman
    0.79
    ordan
    0.78
    sson
    0.77
    Act Density 0.076%

    No Known Activations