INDEX
    Explanations

    words related to trying out new things and exploring different possibilities

    phrases related to experimentation and testing processes

    New Auto-Interp
    Negative Logits
    fixed
    -0.67
    cut
    -0.64
    utra
    -0.63
    IDS
    -0.62
    DoS
    -0.62
    CHAPTER
    -0.62
    say
    -0.61
    posted
    -0.61
    article
    -0.60
    paragraph
    -0.60
    POSITIVE LOGITS
     experimenting
    1.00
     withd
    0.90
    ively
    0.88
     experimented
    0.88
     experimentation
    0.85
     experiment
    0.82
    imental
    0.80
    iments
    0.79
    ally
    0.79
     Experiment
    0.76
    Act Density 0.018%

    No Known Activations