INDEX
Explanations
adjectives that describe levels of difficulty or ease
New Auto-Interp
Negative Logits
esson
-0.18
Priv
-0.16
Priv
-0.16
artz
-0.15
uuid
-0.14
897
-0.14
zc
-0.14
enthal
-0.14
olars
-0.14
elong
-0.14
POSITIVE LOGITS
to
0.32
-to
0.27
_to
0.22
to
0.19
ToRemove
0.18
to
0.18
ToUpdate
0.17
-To
0.16
easy
0.16
ledo
0.16
Activations Density 0.041%