INDEX
    Explanations

    terms related to personal opinions and influences

    New Auto-Interp
    Negative Logits
     pleaſure
    -0.89
    ſelves
    -0.84
     propOrder
    -0.82
     kasarigan
    -0.79
     houſe
    -0.78
     ſei
    -0.78
     Houſe
    -0.77
    Personendaten
    -0.77
     ſind
    -0.76
     ſou
    -0.75
    POSITIVE LOGITS
    [
    0.30
    top
    0.30
     myself
    0.29
    我把
    0.28
    .
    0.28
     xấu
    0.27
    "[
    0.27
    '
    0.26
    []
    0.25
     saw
    0.25
    Act Density 0.323%

    No Known Activations