INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     object
    0.50
    0.50
     if
    0.48
     the
    0.46
     something
    0.45
     bottom
    0.45
     apparent
    0.44
     beginnings
    0.44
     more
    0.43
     unapolog
    0.42
    POSITIVE LOGITS
    等人
    0.47
    heng
    0.45
    ȩ
    0.44
    ǧ
    0.42
    oitte
    0.42
    ਦਰ
    0.41
    arrerol
    0.41
    ouard
    0.41
    cheng
    0.40
    arxiv
    0.39
    Act Density 0.011%

    No Known Activations