INDEX
Explanations
instances of reporting or learning experiences and lessons
New Auto-Interp
Negative Logits
avage
-0.16
Baghd
-0.14
igi
-0.14
athom
-0.14
ấu
-0.14
mav
-0.14
atus
-0.14
lav
-0.13
åĨ
-0.13
Bols
-0.13
POSITIVE LOGITS
discover
0.49
learn
0.48
discovered
0.46
discovery
0.44
discovers
0.44
learns
0.43
Learn
0.42
learn
0.42
Discover
0.42
learned
0.41
Activations Density 0.162%