INDEX
    Explanations

    sentences that express personal experiences and reflections

    New Auto-Interp
    Negative Logits
    )\}$
    -0.78
    ("")]
    
    -0.77
    '],
    
    -0.77
    '),
    
    -0.76
    "),
    
    -0.74
    ;";
    -0.70
    />";
    -0.69
    ]}"
    -0.68
    )";
    
    -0.67
    /";
    -0.67
    POSITIVE LOGITS
    because
    0.70
     they
    0.69
    They
    0.67
     because
    0.66
     he
    0.66
     I
    0.65
    本当は
    0.61
    Because
    0.59
     They
    0.58
     originally
    0.58
    Act Density 0.373%

    No Known Activations