INDEX
    Explanations

    large language models, scale, training

    New Auto-Interp
    Negative Logits
     marital
    0.47
    вий
    0.45
     Herr
    0.42
     زینت
    0.41
     Wife
    0.41
     seines
    0.41
     दर्श
    0.40
     Konto
    0.40
     Zuschauer
    0.39
     Sunset
    0.39
    POSITIVE LOGITS
    GPT
    0.66
     enormes
    0.64
     ogrom
    0.63
     GPUs
    0.62
     enorme
    0.61
     énorme
    0.60
     billions
    0.59
    huge
    0.59
    Huge
    0.57
     training
    0.56
    Act Density 0.874%

    No Known Activations