INDEX
    Explanations

    preferences or popular choices

    instances of the word "favorite" and its variations

    New Auto-Interp
    Negative Logits
    urers
    -0.79
    inas
    -0.77
    ural
    -0.76
    idental
    -0.75
    OUT
    -0.74
    heed
    -0.74
    abeth
    -0.74
    okin
    -0.72
    absor
    -0.71
    roup
    -0.71
    POSITIVE LOGITS
     favorites
    0.92
     haunt
    0.89
     favorite
    0.84
     darling
    0.83
     underdog
    0.76
     haun
    0.74
     amongst
    0.73
    é¾įå¥ij士
    0.70
    é¾įå
    0.70
     whipping
    0.69
    Act Density 0.016%

    No Known Activations