INDEX
Explanations
the occurrence of specific named entities or proper nouns
New Auto-Interp
Negative Logits
.~(\
-0.56
Xunit
-0.55
ictured
-0.55
rovna
-0.54
belong
-0.54
않습니다
-0.53
belongs
-0.52
kowitz
-0.52
ౖ
-0.51
zustellen
-0.51
POSITIVE LOGITS
Italijani
0.68
članak
0.68
ftagPool
0.67
AutoModerator
0.65
contentLoaded
0.61
CloseOperation
0.61
дописавши
0.60
lenker
0.59
دیکھیے
0.59
ConstraintMaker
0.58
Activations Density 0.191%