INDEX
Explanations
expressions related to value and worthiness
New Auto-Interp
Negative Logits
Values
-0.18
Values
-0.16
onta
-0.16
values
-0.15
values
-0.15
Whe
-0.14
elli
-0.14
-values
-0.14
du
-0.14
ras
-0.14
POSITIVE LOGITS
worth
0.88
Worth
0.77
worth
0.71
worthwhile
0.52
sworth
0.38
worthy
0.37
worthy
0.32
orth
0.31
ORTH
0.31
-worthy
0.30
Activations Density 0.109%