INDEX
Explanations
sections of text categorized under specific labels or headings
New Auto-Interp
Negative Logits
orsi
-0.76
imer
-0.72
chers
-0.71
ighters
-0.71
uras
-0.69
anza
-0.68
isma
-0.64
ilant
-0.63
asar
-0.63
mine
-0.62
POSITIVE LOGITS
Unc
0.87
Tags
0.85
Categories
0.83
Articles
0.76
Occupations
0.75
POLIT
0.74
Miscellaneous
0.72
Comments
0.72
Conspiracy
0.71
Discrimination
0.71
Activations Density 0.012%