INDEX
Explanations
references to individuals and their roles or contributions within a context
New Auto-Interp
Negative Logits
g
-1.20
g
-1.10
g
-0.74
gens
-0.67
𝑔
-0.67
ging
-0.64
ged
-0.63
gen
-0.63
gating
-0.61
gha
-0.61
POSITIVE LOGITS
Գ
1.02
Г
1.00
GG
0.99
G
0.98
Gu
0.94
GC
0.89
Gi
0.88
GV
0.88
GX
0.87
GF
0.87
Activations Density 0.955%