INDEX
Explanations
titles and subsequent words
New Auto-Interp
Negative Logits
0.61
<start_of_image>
0.54
using
0.45
..
0.44
(
0.44
these
0.43
member
0.42
0.42
depending
0.41
here
0.41
POSITIVE LOGITS
<unused1954>
0.76
<unused162>
0.76
<unused1834>
0.76
<unused1153>
0.76
<unused1992>
0.74
<unused1678>
0.73
<unused305>
0.73
<unused1845>
0.73
<unused565>
0.72
<unused1101>
0.71
Activations Density 0.029%