INDEX
Explanations
neutrals or terms without significant activation signals
Code, URLs, or file paths
Chinese judges
New Auto-Interp
Negative Logits
issante
-0.46
<eos>
-0.44
“
-0.43
’
-0.42
doctype
-0.42
↵↵
-0.42
-0.41
No
-0.41
-0.41
dasar
-0.41
POSITIVE LOGITS
uxxxx
0.90
houſe
0.73
AppColors
0.72
ſche
0.69
الحره
0.69
ſind
0.67
Datuak
0.67
ſelves
0.66
faſt
0.66
Mémoires
0.66
Activations Density 0.012%