INDEX
Explanations
words related to trying out new things and exploring different possibilities
phrases related to experimentation and testing processes
New Auto-Interp
Negative Logits
fixed
-0.67
cut
-0.64
utra
-0.63
IDS
-0.62
DoS
-0.62
CHAPTER
-0.62
say
-0.61
posted
-0.61
article
-0.60
paragraph
-0.60
POSITIVE LOGITS
experimenting
1.00
withd
0.90
ively
0.88
experimented
0.88
experimentation
0.85
experiment
0.82
imental
0.80
iments
0.79
ally
0.79
Experiment
0.76
Activations Density 0.018%