INDEX
    Explanations

    phrases or words related to public or political speeches

    references to public statements or comments made by individuals

    New Auto-Interp
    Negative Logits
    ccording
    -0.78
    otype
    -0.69
    ntil
    -0.64
     Orange
    -0.62
    ramid
    -0.61
     Rescue
    -0.60
    rome
    -0.60
    rafted
    -0.60
    duct
    -0.57
    versely
    -0.57
    POSITIVE LOGITS
     remarks
    1.10
     comments
    0.87
    æĥ
    0.80
     aloud
    0.79
    ä¹ĭ
    0.77
     slurs
    0.77
     goodbye
    0.76
     dispar
    0.75
     uttered
    0.74
    ault
    0.74
    Act Density 0.021%

    No Known Activations