INDEX
Explanations
words related to alcoholic beverages and parties
New Auto-Interp
Negative Logits
iversal
-0.86
aido
-0.82
DAY
-0.81
nesota
-0.80
orate
-0.78
İĭ
-0.78
emade
-0.77
orable
-0.77
omo
-0.77
srf
-0.76
POSITIVE LOGITS
aux
1.14
lli
1.08
llo
0.95
lla
0.93
ux
0.80
bourg
0.79
urs
0.78
agne
0.78
du
0.78
Hollande
0.76
Activations Density 0.009%