INDEX
Explanations
the presence of the word "Written" at the beginning of texts or articles
New Auto-Interp
Negative Logits
Ĭ±
-0.85
nel
-0.80
Shinra
-0.78
agara
-0.71
nels
-0.70
ĪĴ
-0.70
Sensor
-0.69
allows
-0.69
illon
-0.67
alon
-0.67
POSITIVE LOGITS
escription
0.83
written
0.76
aloud
0.76
itatively
0.72
acters
0.72
eloqu
0.70
instrument
0.69
written
0.69
intention
0.69
tongue
0.69
Activations Density 0.027%