INDEX
Explanations
increment operations in code
New Auto-Interp
Negative Logits
eh
-0.71
</i>
-0.70
pe
-0.67
va
-0.64
h
-0.63
os
-0.63
n
-0.63
di
-0.62
ina
-0.61
be
-0.61
POSITIVE LOGITS
++;
2.08
++;
1.93
]++;
1.62
++);
1.44
)++;
1.41
++];
1.41
++;
1.41
++
1.39
++,
1.39
--;
1.30
Activations Density 0.032%