INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     pretended
    -0.77
     preferring
    -0.71
     Acting
    -0.70
    IContainer
    -0.69
     pretending
    -0.68
    seems
    -0.68
     refusing
    -0.65
    Acting
    -0.65
     seeming
    -0.65
    ьаж
    -0.65
    POSITIVE LOGITS
     to
    0.92
    ly
    0.85
    LY
    0.71
    ]='\
    0.59
    Werbung
    0.56
    zunehmen
    0.53
     une
    0.50
    nesses
    0.50
    expandindo
    0.50
    schaft
    0.50
    Act Density 0.328%

    No Known Activations