INDEX
Explanations
words related to impactful actions or events
actions that lead to significant consequences or changes
New Auto-Interp
Negative Logits
ug
-0.69
ombat
-0.68
oha
-0.62
=-=-
-0.62
igger
-0.60
ogo
-0.60
û
-0.60
available
-0.59
ique
-0.59
peg
-0.57
POSITIVE LOGITS
hler
0.65
thereby
0.62
angelo
0.61
arks
0.60
contributions
0.60
cially
0.59
ãĥ¥
0.58
winds
0.58
indu
0.57
compos
0.57
Activations Density 0.135%