INDEX
Explanations
words related to apologies and the concept of apologies itself
New Auto-Interp
Negative Logits
leston
-0.18
assen
-0.17
ledge
-0.16
t
-0.15
ussen
-0.15
eenth
-0.15
lest
-0.15
pot
-0.15
ainty
-0.14
hardt
-0.14
POSITIVE LOGITS
Ap
0.20
ap
0.19
-ap
0.16
à¤łà¤¨
0.16
rika
0.16
ooled
0.16
emann
0.15
regon
0.15
(ap
0.15
portion
0.15
Activations Density 0.018%