INDEX
    Explanations

    references to the word "nice" in various contexts

    New Auto-Interp
    Negative Logits
    reach
    -0.19
    h
    -0.17
    OM
    -0.17
    nd
    -0.16
    l
    -0.15
    lu
    -0.15
     
    -0.15
    atik
    -0.15
    sz
    -0.15
     (
    -0.14
    POSITIVE LOGITS
    olson
    0.20
    -looking
    0.19
    ptune
    0.18
     surprises
    0.17
    olas
    0.17
    eties
    0.17
     surpr
    0.16
    agra
    0.16
    енÑĮ
    0.16
    contri
    0.15
    Act Density 0.022%

    No Known Activations