INDEX
Explanations
urban environments and living
New Auto-Interp
Negative Logits
ي
1.22
紝
0.94
ি
0.85
hindi
0.84
ați
0.83
𝚞
0.83
i
0.82
0.80
sidan
0.80
пищи
0.80
POSITIVE LOGITS
dwellers
1.13
หลวง
0.98
ри
0.90
werke
0.89
隍
0.88
ia
0.87
рија
0.86
slur
0.85
п
0.84
dw
0.83
Activations Density 0.056%