INDEX
    Explanations

    phrases related to comparisons or distinctions between different categories or concepts

    conjunctions and words indicating contrast or alternatives

    New Auto-Interp
    Negative Logits
     Carbuncle
    -0.76
    ngth
    -0.75
     Tears
    -0.75
    ttes
    -0.74
    amines
    -0.74
     chairs
    -0.72
    stanbul
    -0.70
    ulia
    -0.69
    oldemort
    -0.68
     Hew
    -0.66
    POSITIVE LOGITS
     otherwise
    1.12
     nons
    1.03
     unex
    0.99
     non
    0.95
     uns
    0.93
     passive
    0.92
     unprotected
    0.91
     nont
    0.90
     unin
    0.89
     conventional
    0.88
    Act Density 0.120%

    No Known Activations