INDEX
    Explanations

    references to personal favorites or preferences

    New Auto-Interp
    Negative Logits
    er
    -0.84
     l
    -0.70
     I
    -0.69
    </em>
    -0.67
     In
    -0.67
    ers
    -0.66
     r
    -0.66
     was
    -0.65
    man
    -0.64
     n
    -0.64
    POSITIVE LOGITS
     favorites
    1.61
     Favorites
    1.57
     favorite
    1.56
     Favorite
    1.54
     favourite
    1.52
     favourites
    1.52
     Favourite
    1.51
    favourite
    1.46
     FAVORITE
    1.44
    favorite
    1.44
    Act Density 0.042%

    No Known Activations