INDEX
Explanations
culture wars and social issues
New Auto-Interp
Negative Logits
स्वास्थ्य
0.39
نريد
0.39
crossovers
0.38
meridian
0.38
Prints
0.37
Öffentlichkeit
0.37
verbal
0.36
saúde
0.36
speakers
0.36
bast
0.36
POSITIVE LOGITS
wars
0.58
guerras
0.55
Wars
0.51
战
0.50
войны
0.50
perang
0.50
guerra
0.49
戰
0.49
Cultural
0.48
war
0.47
Activations Density 0.005%