INDEX
    Explanations

    references to suffering and deprivation

    phrases related to personal experiences and identity

    New Auto-Interp
    Negative Logits
    hess
    -0.67
     Tut
    -0.60
     Klaus
    -0.54
     Milton
    -0.53
     Naples
    -0.52
     Augusta
    -0.51
     Augustus
    -0.50
     Fib
    -0.49
     Prix
    -0.49
     Hammond
    -0.49
    POSITIVE LOGITS
    */(
    0.73
    laughs
    0.70
    awaru
    0.68
    Laughs
    0.65
    É
    0.64
    EStream
    0.63
    ¯
    0.61
    ymes
    0.61
    .?
    0.61
    {\
    0.59
    Act Density 2.301%

    No Known Activations