INDEX
Explanations
mentions of measurement units such as 'g' and 'cm'
instances of a specific character or symbol
New Auto-Interp
Negative Logits
Polk
-0.71
Salman
-0.70
Virgin
-0.68
Roose
-0.66
Salon
-0.65
Antar
-0.65
Muss
-0.63
Jihad
-0.62
Barbara
-0.62
democracy
-0.61
POSITIVE LOGITS
felt
1.15
s
1.12
shed
1.05
ses
1.05
sed
1.05
won
1.03
sure
1.01
should
0.97
ved
0.95
erent
0.95
Activations Density 0.279%