INDEX
Explanations
references to written content or authorship
instances of the word "writes."
New Auto-Interp
Negative Logits
RIS
-0.64
gest
-0.61
cept
-0.61
rium
-0.60
erest
-0.60
ground
-0.58
trailer
-0.56
season
-0.56
halftime
-0.55
frac
-0.55
POSITIVE LOGITS
writes
3.58
wrote
2.29
write
1.91
reads
1.82
writ
1.75
Writ
1.66
wrote
1.65
publishes
1.61
observes
1.52
Writing
1.49
Activations Density 0.012%