INDEX
Explanations
dates written in a specific format (month, day, year) combined with specific usernames
commas in the text
New Auto-Interp
Negative Logits
expansions
-0.85
detectors
-0.76
predec
-0.72
successors
-0.68
expansion
-0.67
connections
-0.66
unnecess
-0.66
avorite
-0.65
glim
-0.64
superiors
-0.64
POSITIVE LOGITS
000
0.95
2017
0.90
2016
0.89
2015
0.88
2018
0.85
05
0.85
2014
0.84
2012
0.82
080
0.81
2010
0.81
Activations Density 0.051%