INDEX
    Explanations

    phrases expressing newness or innovation

    New Auto-Interp
    Negative Logits
     col
    -0.18
    lopen
    -0.14
    edef
    -0.14
    darwin
    -0.14
    ader
    -0.14
    inke
    -0.14
    _activate
    -0.14
     Suff
    -0.13
     Cary
    -0.13
    605
    -0.13
    POSITIVE LOGITS
     feature
    0.19
     novel
    0.18
     unique
    0.17
     nov
    0.17
     novelty
    0.16
    feature
    0.16
     twist
    0.15
    Unique
    0.15
    hower
    0.15
     Feature
    0.14
    Act Density 0.094%

    No Known Activations