INDEX
Explanations
phrases expressing fluctuations or changes, particularly those that indicate positive (ups) and negative (downs) experiences
New Auto-Interp
Negative Logits
lessly
-0.17
296
-0.15
cken
-0.15
deepest
-0.15
Erk
-0.15
Pant
-0.15
ovice
-0.14
opsis
-0.14
ort
-0.14
Grip
-0.14
POSITIVE LOGITS
/down
0.44
-down
0.25
/up
0.23
datable
0.20
ilon
0.20
scaling
0.19
graded
0.19
most
0.19
ward
0.19
ILON
0.18
Activations Density 0.049%