INDEX
    Explanations

    the name "Han" followed by any other token, with different activation levels based on specific contexts

    mentions of the name "Han."

    New Auto-Interp
    Negative Logits
     Thumbnails
    -0.68
     destro
    -0.65
    ODUCT
    -0.65
    cipline
    -0.64
    tsky
    -0.63
     Colossus
    -0.63
     IMAGES
    -0.62
    utics
    -0.62
    llan
    -0.61
    Downloadha
    -0.61
    POSITIVE LOGITS
    auer
    1.12
    ning
    1.04
    nington
    0.99
    lon
    0.99
    wei
    0.98
    uman
    0.98
     Solo
    0.95
    wal
    0.88
    ako
    0.86
    ergy
    0.86
    Act Density 0.027%

    No Known Activations