INDEX
Explanations
specific numerical values mentioned in the text
New Auto-Interp
Negative Logits
wark
-0.77
ACP
-0.76
Å
-0.73
anu
-0.71
itute
-0.69
Led
-0.69
È
-0.68
owered
-0.68
Identified
-0.67
jured
-0.66
POSITIVE LOGITS
blah
1.23
stuff
1.20
assorted
1.00
maybe
0.98
lots
0.92
messing
0.89
everything
0.89
crappy
0.87
goodies
0.86
shenanigans
0.86
Activations Density 0.326%