INDEX
    Explanations

    references to academic papers or research articles

    mentions of research papers or academic publications

    New Auto-Interp
    Negative Logits
    alez
    -0.87
    akening
    -0.76
    aren
    -0.64
    endor
    -0.62
    cffffcc
    -0.62
    iak
    -0.60
    ostic
    -0.59
    rt
    -0.59
    eal
    -0.59
    rogens
    -0.58
    POSITIVE LOGITS
    Paper
    1.10
    clip
    1.01
     towels
    0.88
     paper
    0.87
     Paper
    0.87
     papers
    0.84
    flies
    0.78
    papers
    0.76
     towel
    0.76
    books
    0.76
    Act Density 0.013%

    No Known Activations