INDEX
    Explanations

    urban environments and living

    New Auto-Interp
    Negative Logits
    ي
    1.22
    0.94
    ি
    0.85
    hindi
    0.84
    ați
    0.83
    𝚞
    0.83
    i
    0.82
                          
    0.80
     sidan
    0.80
     пищи
    0.80
    POSITIVE LOGITS
     dwellers
    1.13
    หลวง
    0.98
    ри
    0.90
    werke
    0.89
    0.88
    ia
    0.87
    рија
    0.86
     slur
    0.85
    п
    0.84
     dw
    0.83
    Act Density 0.056%

    No Known Activations