INDEX
Explanations
words related to innovations or advancements in technology and systems
New Auto-Interp
Negative Logits
-1.02
-0.92
-0.89
•
-0.87
-0.85
-0.84
.
-0.79
<
-0.75
-0.74
→
-0.74
POSITIVE LOGITS
youll
1.82
youre
1.80
theyre
1.75
Thats
1.67
didnt
1.65
doesnt
1.65
Dont
1.61
Dont
1.61
isnt
1.61
wasnt
1.60
Activations Density 0.195%