INDEX
Explanations
determiner phrases, particularly ones starting with "the" or "a"
New Auto-Interp
Negative Logits
itſelf
-1.17
Theſe
-1.17
་་
-1.16
Jefus
-1.13
―――――
-1.12
iſt
-1.11
myſelf
-1.11
ſind
-1.08
محفوظة
-1.07
)";
-1.06
POSITIVE LOGITS
on
1.34
On
0.86
0.86
On
0.81
ON
0.75
to
0.74
in
0.74
↵↵
0.71
.
0.71
at
0.71
Activations Density 0.197%