INDEX
Explanations
lists, conjunctions, and special characters
New Auto-Interp
Negative Logits
drug
0.73
different
0.73
occupation
0.70
phrases
0.70
фей
0.70
distances
0.70
outages
0.69
İlk
0.69
distances
0.68
ordre
0.68
POSITIVE LOGITS
знают
0.82
abstra
0.80
внимание
0.79
jetzt
0.78
torr
0.75
гражда
0.74
祉
0.74
واحدة
0.73
спасибо
0.72
م
0.72
Activations Density 0.001%