INDEX
Explanations
terms that describe size and magnitude
large followed by noun
New Auto-Interp
Negative Logits
2
-0.34
1
-0.34
3
-0.30
explain
-0.29
perbuatan
-0.28
5
-0.28
based
-0.28
6
-0.28
elkaar
-0.28
jums
-0.28
POSITIVE LOGITS
IBOutlet
0.78
surla
0.74
imagui
0.73
0.72
zwiſchen
0.72
ſſung
0.72
<unused42>
0.71
<unused41>
0.71
<pad>
0.71
[@BOS@]
0.71
Activations Density 0.046%