INDEX
Explanations
Titles of books and their details
New Auto-Interp
Negative Logits
very
0.26
closing
0.26
likes
0.25
Hän
0.25
la
0.25
funny
0.25
lá
0.25
rainbows
0.24
waves
0.24
distractions
0.23
POSITIVE LOGITS
<unused2037>
0.30
여섯
0.29
原子炉
0.29
kitabı
0.29
পাকিস্তানের
0.28
<unused139>
0.28
изпол
0.28
<unused146>
0.28
<unused567>
0.28
<unused1837>
0.28
Activations Density 0.007%