INDEX
    Explanations

    phrases that denote precision or specificity in statements

    New Auto-Interp
    Negative Logits
     Flavoring
    -0.70
    oké
    -0.70
    oscopic
    -0.69
    ngth
    -0.66
    rift
    -0.66
    cffff
    -0.64
    rug
    -0.64
    ocene
    -0.64
     sacrific
    -0.63
    itiz
    -0.62
    POSITIVE LOGITS
    ãĤ¨
    0.78
     opposite
    0.76
     wrong
    0.68
     why
    0.68
     analogous
    0.65
    minus
    0.64
     correct
    0.64
    µ
    0.61
     Horowitz
    0.61
    ¯
    0.61
    Act Density 0.006%

    No Known Activations