INDEX
    Explanations

    primary intention or process

    assistant/model-response segments within a chat transcript (i.e., content from the model’s turn rather than the user’s).

    New Auto-Interp
    Negative Logits
     Neighbourhood
    0.41
     thomas
    0.40
     নিঃসন্দেহে
    0.40
     pozycji
    0.39
     THOMAS
    0.39
     obstáculos
    0.39
     Loksatta
    0.39
     alltid
    0.38
    0.38
     berlin
    0.37
    POSITIVE LOGITS
     Highlander
    0.49
    linien
    0.47
     معين
    0.46
    પતિ
    0.46
    itul
    0.44
    нюю
    0.43
    ii
    0.42
    idas
    0.42
    يك
    0.41
    उच्च
    0.41
    Act Density 15.583%

    No Known Activations