INDEX
    Explanations

    terms related to societal structures, power dynamics, and social values

    New Auto-Interp
    Negative Logits
    -0.66
    [])
    
    -0.59
    ******/
    -0.54
    []){
    -0.53
    "]));
    -0.53
    ()])
    -0.53
    '][]
    -0.52
    [])
    -0.51
    []
    
    -0.51
    ,:),
    -0.50
    POSITIVE LOGITS
     always
    1.00
    always
    0.90
     siempre
    0.85
     alone
    0.82
     needn
    0.81
     often
    0.79
     usually
    0.79
     всегда
    0.76
     itself
    0.75
    と聞
    0.75
    Act Density 0.737%

    No Known Activations