INDEX
Explanations
programming-related syntactical structures or symbols
New Auto-Interp
Negative Logits
Č
-0.22
471
-0.15
NgÃłnh
-0.15
Pods
-0.15
``
-0.14
844
-0.14
andez
-0.14
utsch
-0.14
ewe
-0.14
ienda
-0.13
POSITIVE LOGITS
↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵↵
0.27
↵↵↵↵↵↵↵
0.26
↵↵↵↵↵
0.25
↵↵↵↵↵↵↵↵
0.24
↵↵↵↵
0.23
↵↵↵↵↵↵↵↵↵↵
0.22
↵↵↵↵↵↵↵↵↵
0.22
↵↵↵↵↵↵
0.21
↵↵↵↵↵↵↵↵↵↵↵
0.20
↵↵↵↵↵↵↵↵↵↵↵↵
0.20
Activations Density 0.118%