INDEX
    Explanations

    affirmative phrases and statements related to self-awareness and acknowledgment

    New Auto-Interp
    Negative Logits
    "];
    
    -0.70
    !")
    
    -0.68
     متعلقه
    -0.68
    "):
    
    -0.67
    '];
    
    -0.64
    //
    -0.64
    "]).
    -0.63
    ")));
    
    -0.63
     &___
    -0.63
    ()]
    
    -0.63
    POSITIVE LOGITS
     disagree
    0.59
     apples
    0.51
    Distribuzione
    0.51
     disprove
    0.49
     oike
    0.48
     facts
    0.47
     argument
    0.47
     rebuttal
    0.47
    impianto
    0.47
    事實
    0.47
    Act Density 0.448%

    No Known Activations