INDEX
Explanations
phrases indicating simplicity or ease of understanding
New Auto-Interp
Negative Logits
hips
-0.78
eters
-0.78
reon
-0.75
grave
-0.74
raints
-0.72
mbuds
-0.68
orp
-0.66
orf
-0.64
arians
-0.63
emp
-0.62
POSITIVE LOGITS
Jet
0.93
going
0.91
prey
0.80
wallet
0.80
coded
0.78
easy
0.72
accessible
0.70
forgiving
0.70
minded
0.70
ily
0.69
Activations Density 0.024%