INDEX
    Explanations

    mentions of personal preferences or favorites

    instances of the word "favorite" and its variations

    New Auto-Interp
    Negative Logits
    aping
    -0.86
    ural
    -0.85
    heed
    -0.84
    attle
    -0.79
    aton
    -0.79
    atan
    -0.78
    urers
    -0.77
    arin
    -0.74
    athered
    -0.74
    sten
    -0.74
    POSITIVE LOGITS
    Favorite
    1.02
     favorite
    0.87
     pokemon
    0.79
     whipping
    0.76
     Favorite
    0.75
     moments
    0.73
     darling
    0.73
     hobbies
    0.72
     favorites
    0.72
     sibling
    0.71
    Act Density 0.021%

    No Known Activations