INDEX
Explanations
offering help and polite conversation
New Auto-Interp
Negative Logits
==
0.49
These
0.38
=
0.38
meaningless
0.37
केंद्रित
0.37
wget
0.37
downloading
0.37
rotting
0.36
cardinality
0.36
attacks
0.36
POSITIVE LOGITS
conversar
0.61
politely
0.59
conversación
0.59
courteous
0.56
membantu
0.55
помочь
0.55
respectful
0.55
cheerfully
0.54
আন্তরিক
0.53
help
0.53
Activations Density 1.877%