INDEX
    Explanations
    New Auto-Interp
    Negative Logits
    HStack
    -0.93
     being
    -0.88
    اطر
    -0.86
    することで
    -0.84
     the
    -0.84
     whose
    -0.82
     America
    -0.82
     which
    -0.82
    رده
    -0.82
     Dalam
    -0.81
    POSITIVE LOGITS
     whenever
    1.34
     and
    1.22
    whenever
    1.05
    holds
    1.02
     holds
    0.98
    ldorf
    0.91
    Whenever
    0.91
    рым
    0.86
    idenav
    0.81
    ston
    0.79
    Act Density 0.046%

    No Known Activations