INDEX
    Explanations
    New Auto-Interp
    Negative Logits
     %(
    -0.08
    verb
    -0.08
    ilmiştir
    -0.07
    stanbul
    -0.07
     başk
    -0.07
     restroom
    -0.06
     است
    -0.06
    ingt
    -0.06
    etadata
    -0.06
    security
    -0.06
    POSITIVE LOGITS
     Alan
    0.17
    Alan
    0.14
     alan
    0.13
    -An
    0.08
     A
    0.07
     Ian
    0.07
    .A
    0.07
     aba
    0.07
     mun
    0.06
    .a
    0.06
    Act Density 0.002%

    No Known Activations