INDEX
    Explanations

    mentions of human values

    references to core values and principles

    New Auto-Interp
    Negative Logits
    女
    -0.72
    igans
    -0.69
    geon
    -0.69
    jen
    -0.68
    sie
    -0.67
    ready
    -0.67
    nih
    -0.66
    \/\/
    -0.66
    DERR
    -0.65
    fac
    -0.64
    POSITIVE LOGITS
    iblings
    0.80
     ideals
    0.76
     values
    0.72
     tolerance
    0.71
     principles
    0.70
     Values
    0.69
    cape
    0.69
     beliefs
    0.67
     embodied
    0.67
     Advocate
    0.66
    Act Density 0.027%

    No Known Activations