INDEX
    Explanations

    phrases contrasting differences

    New Auto-Interp
    Negative Logits
    ATA
    -0.88
    vez
    -0.82
    rollers
    -0.77
    mberg
    -0.72
    rive
    -0.71
    WI
    -0.62
    kamp
    -0.62
    ale
    -0.60
    ITED
    -0.60
     staples
    -0.59
    POSITIVE LOGITS
     between
    1.24
    between
    1.12
     Between
    1.07
    iveness
    1.02
    iator
    0.98
    iating
    0.96
    ials
    0.92
    erence
    0.91
    yip
    0.84
     maker
    0.84
    Act Density 0.040%

    No Known Activations