INDEX
Explanations
mentions of reading or books
New Auto-Interp
Negative Logits
xon
-0.67
ño
-0.64
Enlarge
-0.62
lla
-0.61
ortality
-0.60
ascal
-0.59
ctions
-0.59
cker
-0.59
trl
-0.58
ality
-0.58
POSITIVE LOGITS
aloud
1.49
just
1.15
comprehension
1.09
dress
0.98
excerpts
0.96
books
0.94
write
0.89
texts
0.88
mitt
0.88
mill
0.86
Activations Density 0.040%