INDEX
Explanations
phrases and expressions directed at specific groups of people
New Auto-Interp
Negative Logits
rts
-0.15
oci
-0.14
irst
-0.14
_SPE
-0.14
utor
-0.14
人人
-0.14
ftime
-0.14
ushima
-0.13
safest
-0.13
nowledge
-0.13
POSITIVE LOGITS
readers
0.26
unfamiliar
0.23
wondering
0.22
Readers
0.21
reader
0.21
wonder
0.19
reader
0.19
-reader
0.18
reading
0.18
Wonder
0.18
Activations Density 0.055%