INDEX
Explanations
repeated patterns or references to "loop" in various contexts
New Auto-Interp
Negative Logits
ibs
-0.18
ebo
-0.18
quez
-0.16
onne
-0.15
lee
-0.15
onse
-0.15
.uf
-0.15
wig
-0.14
Cru
-0.14
../
-0.14
POSITIVE LOGITS
-loop
0.27
(loop
0.21
loop
0.21
Loop
0.20
Loop
0.19
loop
0.19
LOOP
0.18
loops
0.17
.loop
0.17
otron
0.16
Activations Density 0.016%