INDEX
Explanations
references to novels and fictional works
New Auto-Interp
Negative Logits
[
-0.55
y
-0.51
Dance
-0.51
لاثة
-0.50
disciplinary
-0.50
in
-0.50
Ehren
-0.49
-
-0.49
ش
-0.47
.
-0.47
POSITIVE LOGITS
novel
3.63
Novel
3.38
novel
3.28
Novel
3.17
NOVEL
3.02
novels
2.51
Novels
2.27
novelist
1.80
novelists
1.74
novelty
1.64
Activations Density 0.069%