INDEX
Explanations
large language models, scale, training
New Auto-Interp
Negative Logits
marital
0.47
вий
0.45
Herr
0.42
زینت
0.41
Wife
0.41
seines
0.41
दर्श
0.40
Konto
0.40
Zuschauer
0.39
Sunset
0.39
POSITIVE LOGITS
GPT
0.66
enormes
0.64
ogrom
0.63
GPUs
0.62
enorme
0.61
énorme
0.60
billions
0.59
huge
0.59
Huge
0.57
training
0.56
Activations Density 0.874%