INDEX
Explanations
recognize faces or excuse behavior
New Auto-Interp
Negative Logits
}$\
0.44
ላል
0.44
빅
0.43
्ये
0.43
ಞ
0.42
ତା
0.42
들
0.42
稈
0.42
eningen
0.41
言っ
0.41
POSITIVE LOGITS
ަ
0.45
Bruno
0.43
Hanya
0.42
ص
0.42
stör
0.41
faibles
0.41
这部
0.41
Author
0.41
ق
0.40
Sapir
0.40
Activations Density 0.003%