INDEX
Explanations
phrases related to things staying the same or not being affected
terms related to stability or lack of change
New Auto-Interp
Negative Logits
ç«
-0.68
alez
-0.67
Typhoon
-0.67
aph
-0.65
RH
-0.62
McKenna
-0.60
=-=-=-=-=-=-=-=-
-0.60
ingo
-0.60
¯¯
-0.60
eur
-0.59
POSITIVE LOGITS
unchanged
1.24
untouched
0.85
unaffected
0.83
iated
0.80
ishment
0.74
theless
0.73
aneously
0.71
ãĤ´
0.70
iating
0.68
interpol
0.67
Activations Density 0.006%