INDEX
Explanations
phrases that express direct addresses or inclusivity towards the audience
New Auto-Interp
Negative Logits
utor
-0.15
人人
-0.14
rts
-0.14
_SPE
-0.14
secondary
-0.13
ä¼ı
-0.13
Surveillance
-0.13
$(
-0.13
oci
-0.13
stride
-0.13
POSITIVE LOGITS
readers
0.31
reader
0.26
Readers
0.25
unfamiliar
0.24
reader
0.22
Reader
0.22
Reader
0.21
reading
0.20
-reader
0.20
wondering
0.19
Activations Density 0.071%