INDEX
    Explanations

    defining concepts and language

    New Auto-Interp
    Negative Logits
     mentality
    0.70
    aholic
    0.66
     sparkling
    0.65
     apathy
    0.64
     nagging
    0.63
     compulsory
    0.63
    0.62
     unfiltered
    0.60
    0.60
     complacency
    0.59
    POSITIVE LOGITS
    Serum
    0.58
    мо
    0.58
    Barcelona
    0.57
    উপ
    0.56
    itinéraire
    0.56
    Macrophages
    0.55
    ReLU
    0.55
    0.54
    Tür
    0.53
    Kab
    0.53
    Act Density 0.496%

    No Known Activations