INDEX
    Explanations

    specific terms or phrases related to toxicity and its effects

    New Auto-Interp
    Negative Logits
    eſ
    -0.62
    ftance
    -0.61
    phim
    -0.61
    ftant
    -0.60
    énario
    -0.59
    citenamefont
    -0.58
    DockStyle
    -0.58
    ftances
    -0.57
    iffance
    -0.56
    ebvre
    -0.56
    POSITIVE LOGITS
    
    0.67
    f
    0.50
     vecka
    0.48
     investissements
    0.48
    w
    0.47
    M
    0.47
    h
    0.47
    B
    0.46
     ricerche
    0.46
    b
    0.45
    Act Density 0.548%

    No Known Activations