INDEX
Explanations
references to foundational principles or frameworks based on empirical data
New Auto-Interp
Negative Logits
iba
-0.16
quoise
-0.15
µľ
-0.15
okino
-0.14
ADE
-0.14
/Dk
-0.14
ɵ
-0.14
orelease
-0.14
enci
-0.14
qua
-0.13
POSITIVE LOGITS
experience
0.35
experiences
0.31
principles
0.29
observations
0.29
observation
0.28
experience
0.27
principle
0.27
feedback
0.26
input
0.25
observations
0.24
Activations Density 0.307%