INDEX
    Explanations

    concepts related to goodness and morality

    New Auto-Interp
    Negative Logits
     strikingly
    -0.49
    colari
    -0.48
    IntoConstraints
    -0.48
     closely
    -0.47
     دقی
    -0.46
     precisely
    -0.46
    contentLoaded
    -0.46
     comparatively
    -0.45
    ctically
    -0.45
    xrTableCell
    -0.45
    POSITIVE LOGITS
    Good
    0.78
     Good
    0.73
    good
    0.65
    GOOD
    0.64
     GOOD
    0.63
     good
    0.56
     estekak
    0.54
    Evil
    0.52
     ""],
    0.51
    好人
    0.50
    Act Density 0.033%

    No Known Activations