INDEX
Explanations
frequent occurrences of the word "the."
New Auto-Interp
Negative Logits
spørs
-0.65
overras
-0.64
spørsmål
-0.61
Anſ
-0.60
Wikiseite
-0.60
anggung
-0.60
iſen
-0.59
juſ
-0.59
Reſ
-0.59
raiſ
-0.59
POSITIVE LOGITS
of
0.55
OF
0.55
ofthe
0.52
của
0.51
Filip
0.49
OfThe
0.49
Acht
0.47
ⓧ
0.46
Of
0.45
OfClass
0.45
Activations Density 0.428%