INDEX
Explanations
percentages or ratios
phrases that indicate action or movement towards a specific goal or outcome
New Auto-Interp
Negative Logits
appropri
-0.76
orno
-0.67
enforcement
-0.65
postage
-0.64
mosa
-0.63
exch
-0.62
enforcement
-0.62
coordination
-0.61
censored
-0.60
censorship
-0.60
POSITIVE LOGITS
earn
0.96
venge
0.91
finish
0.89
save
0.82
pload
0.81
overcome
0.81
win
0.81
ggles
0.81
celebrate
0.80
clinch
0.77
Activations Density 0.177%