INDEX
Explanations
references to geographic locations, specifically focusing on names of cities or countries
references to specific ethnic groups or nationalities
New Auto-Interp
Negative Logits
Edison
-0.76
arily
-0.71
WARN
-0.65
ister
-0.62
closed
-0.62
Predator
-0.62
envelope
-0.61
angered
-0.60
ODUCT
-0.60
sburg
-0.59
POSITIVE LOGITS
istani
0.98
lers
0.97
bones
0.92
ler
0.90
istan
0.87
ling
0.84
oglu
0.84
wei
0.83
mens
0.83
lings
0.83
Activations Density 0.025%