INDEX
Explanations
structured formats or patterns in text, particularly in questions and answer options
New Auto-Interp
Negative Logits
iac
-0.15
iar
-0.14
ores
-0.14
ium
-0.14
iam
-0.13
ã
-0.13
Premi
-0.13
cow
-0.13
éŁ¿
-0.13
oba
-0.13
POSITIVE LOGITS
ecz
0.16
-none
0.15
markup
0.14
olini
0.14
ожд
0.14
گراÙĨ
0.14
#ga
0.13
ertino
0.13
GRAM
0.13
cri
0.13
Activations Density 0.005%