INDEX
Explanations
fractions or ratios
phrases indicating proportions or ratios
New Auto-Interp
Negative Logits
irez
-0.61
andals
-0.60
Offline
-0.59
ymm
-0.59
idel
-0.56
isl
-0.55
ourgeois
-0.54
voice
-0.54
hai
-0.54
},"
-0.54
POSITIVE LOGITS
every
0.90
ten
0.79
100
0.77
tens
0.74
equals
0.70
bounds
0.69
10
0.69
365
0.68
435
0.68
189
0.67
Activations Density 0.032%